[00:07:20] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:11:58] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:16:16] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T339178 (10phaultfinder) [00:21:56] 10SRE, 10SRE-Access-Requests, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) a:03Dzahn [00:22:17] 10SRE, 10SRE-Access-Requests, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) 05Open→03In progress [00:24:00] (03PS2) 10Eevans: cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) [00:26:53] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans) [00:39:02] (03PS3) 10Eevans: cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) [00:39:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929761 [00:39:33] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929761 (owner: 10TrainBranchBot) [00:40:28] (03PS4) 10Eevans: cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) [00:46:41] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans) [00:54:21] (03PS1) 10Jsn.sherman: beta: log click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280 [01:00:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/929761 (owner: 10TrainBranchBot) [01:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:43:02] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10Legoktm) [01:43:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:50:42] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:28] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:35:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:01:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:07:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:10:56] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:14:02] 10SRE, 10Infrastructure-Foundations, 10Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) For now we are considering only the 'request_time_ms'. We are taking request time for all the probes/pulses and g... [04:20:14] 10SRE, 10Infrastructure-Foundations, 10Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10JameelKaisar) **Probenet Results:** - Belarus (BY) {F37104295} - Czechia (CZ) {F37104297} - Kazakstan (KZ) {F37104299}... [04:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:01:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:02:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:07] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati) [05:12:07] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929020 (https://phabricator.wikimedia.org/T318436) (owner: 10Lupok) [05:19:51] (03CR) 10Kevin Bazira: [C: 03+1] kserve-inference: refactor the predictor's container settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/930209 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [05:20:22] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: set readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930200 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [05:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:31:26] (03CR) 10Marostegui: [C: 03+2] db1124,db1125,db1133: Binlog set to SBR [puppet] - 10https://gerrit.wikimedia.org/r/929948 (https://phabricator.wikimedia.org/T322993) (owner: 10Marostegui) [05:33:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 to upgrade to 10.6.14 T338918', diff saved to https://phabricator.wikimedia.org/P49430 and previous config saved to /var/cache/conftool/dbconfig/20230615-053318-root.json [05:33:23] T338918: Compile and package 10.6.14 - https://phabricator.wikimedia.org/T338918 [05:33:47] (03PS1) 10Marostegui: db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/930292 [05:34:38] (03CR) 10Marostegui: [C: 03+2] db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/930292 (owner: 10Marostegui) [05:37:54] (03PS1) 10Marostegui: Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/930307 [05:38:33] (03CR) 10Marostegui: [C: 03+2] Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/930307 (owner: 10Marostegui) [05:47:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49431 and previous config saved to /var/cache/conftool/dbconfig/20230615-054716-root.json [05:52:05] (03PS1) 10Jameel Kaisar: Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0600) [06:00:05] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0600). [06:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49432 and previous config saved to /var/cache/conftool/dbconfig/20230615-060220-root.json [06:04:00] (03PS1) 10Muehlenhoff: Record new MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/930296 [06:06:49] (03CR) 10Muehlenhoff: [C: 03+2] Record new MOU date for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/930296 (owner: 10Muehlenhoff) [06:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49433 and previous config saved to /var/cache/conftool/dbconfig/20230615-061725-root.json [06:23:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [06:30:00] (03CR) 10Ayounsi: [C: 03+1] "I checked the mappings, +1 there." [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [06:31:34] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID 2066 [06:31:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID 2066 [06:32:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49434 and previous config saved to /var/cache/conftool/dbconfig/20230615-063230-root.json [06:32:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:39:41] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Correctly locate firewall type for IDM. [puppet] - 10https://gerrit.wikimedia.org/r/930177 (owner: 10Slyngshede) [06:39:52] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet has been disabled for 604926 seconds, message: Journal node is about to be decommissioned thus, swap the journal node with another -T338336 - {USER} - stevemunene, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:47:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49435 and previous config saved to /var/cache/conftool/dbconfig/20230615-064734-root.json [07:00:05] Amir1, apergos, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0700). [07:00:16] morning! [07:00:26] today there are no patches scheduled for deployment in the calendar [07:00:35] likewise, no trainees have signed up for this slot [07:01:15] Nice and peaceful Thursday morning then [07:01:24] yep, see everyone next time! [07:01:59] (03PS1) 10Slyngshede: Keymanagement: Fix squashed migration [software/bitu] - 10https://gerrit.wikimedia.org/r/930512 [07:02:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49436 and previous config saved to /var/cache/conftool/dbconfig/20230615-070239-root.json [07:03:09] (03PS2) 10Jameel Kaisar: Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) [07:03:22] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Keymanagement: Fix squashed migration [software/bitu] - 10https://gerrit.wikimedia.org/r/930512 (owner: 10Slyngshede) [07:06:23] (03CR) 10Jameel Kaisar: [C: 03+1] Update mappings for some countries based on initial Probenet data (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [07:11:11] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host mw1492.eqiad.wmnet with OS buster [07:11:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host mw1492.eqiad.wmnet with OS buster [07:12:46] (03CR) 10Ayounsi: "Could it be a validator instead, to catch the issue if the interface is created/modified manually too? (or with different automation). And" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [07:17:10] (03CR) 10Elukey: [C: 03+2] kserve-inference: refactor the predictor's container settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/930209 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [07:17:17] (03CR) 10Elukey: [C: 03+2] ml-services: set readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930200 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [07:17:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49437 and previous config saved to /var/cache/conftool/dbconfig/20230615-071744-root.json [07:24:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage [07:27:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1492.eqiad.wmnet with reason: host reimage [07:27:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133 (10ayounsi) [07:28:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: test_matching_vlan() function crashing in Netbox network report - https://phabricator.wikimedia.org/T339133 (10ayounsi) [07:29:58] (03PS1) 10Elukey: ml-services: add "container" dict in experimental bloom-560m [deployment-charts] - 10https://gerrit.wikimedia.org/r/930514 [07:31:30] (03CR) 10Elukey: [C: 03+2] ml-services: add "container" dict in experimental bloom-560m [deployment-charts] - 10https://gerrit.wikimedia.org/r/930514 (owner: 10Elukey) [07:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49438 and previous config saved to /var/cache/conftool/dbconfig/20230615-073248-root.json [07:34:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:45:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:45:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Papaul thanks! I confirm that it works :) I think that there is only one thing to do, namely update the documentation (https://wikitech.wikimedia.org/wiki/Management_Interfaces#Di... [07:46:20] (03CR) 10Ayounsi: "Overall lgtm, was it tested? Maybe we can compare the execution time to see the improvement?" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) (owner: 10Cathal Mooney) [07:46:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:58] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [07:55:56] !log elukey@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1001" [08:00:04] jnuche and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0800). [08:00:39] morning, I'll roll forward the train in 5m [08:04:58] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930516 (https://phabricator.wikimedia.org/T337527) [08:05:00] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930516 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [08:05:46] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930516 (https://phabricator.wikimedia.org/T337527) (owner: 10TrainBranchBot) [08:06:24] (03CR) 10Jelto: [C: 03+2] miscweb: add transparencyreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930188 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [08:07:14] (03Merged) 10jenkins-bot: miscweb: add transparencyreport release to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/930188 (https://phabricator.wikimedia.org/T338781) (owner: 10Jelto) [08:10:10] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [08:11:19] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [08:11:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:11:57] (03CR) 10Vgutierrez: [C: 03+1] role::cache::{text,upload}: move ulsfo varnishkafkas to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:13:04] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [08:13:31] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.13 refs T337527 [08:13:34] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [08:15:02] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [08:16:37] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:19:27] (03CR) 10Muehlenhoff: [C: 03+2] Add a define to declare an nftables set in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/929315 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:20:20] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5025.eqsin.wmnet [08:20:23] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5017.eqsin.wmnet [08:20:58] !log reboot cp5017 and cp5025 for kernel upgrade (T335835) [08:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:09] (03PS1) 10Jaime Nuche: jenkins: add doc rsync password to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168) [08:27:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:31:17] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5017.eqsin.wmnet [08:31:24] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5025.eqsin.wmnet [08:31:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:32:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:34:41] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/930517/41738/" [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [08:37:19] (03CR) 10Muehlenhoff: [C: 03+1] "Key confirmed via out-of-band validation" [puppet] - 10https://gerrit.wikimedia.org/r/929994 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez) [08:37:26] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:37:33] (03CR) 10Muehlenhoff: [C: 03+1] "Key confirmed via out-of-band validation" [homer/public] - 10https://gerrit.wikimedia.org/r/929998 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez) [08:37:53] (03CR) 10Vgutierrez: [C: 03+2] admin: Update vgutierrez@yubikey5 key [puppet] - 10https://gerrit.wikimedia.org/r/929994 (https://phabricator.wikimedia.org/T336769) (owner: 10Vgutierrez) [08:42:04] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:47:40] (03CR) 10Jelto: "looks mostly good, thanks for moving the bash script out of a erb template. Two comments in line." [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [08:52:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1001" [08:52:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1492.eqiad.wmnet with OS buster [08:52:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host mw1492.eqiad.wmnet with OS buster completed: - mw1492 (**WARN**) - Downtimed on Icinga/Alertm... [08:54:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [08:57:40] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::cache::{text,upload}: move ulsfo varnishkafkas to PKI [puppet] - 10https://gerrit.wikimedia.org/r/929963 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [08:58:50] RECOVERY - mediawiki-installation DSH group on mw1492 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:59:29] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1492.eqiad.wmnet [08:59:29] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1492.eqiad.wmnet [09:00:07] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw1492.eqiad.wmnet [09:00:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw1492.eqiad.wmnet [09:00:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [09:01:00] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5026.eqsin.wmnet [09:01:01] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5018.eqsin.wmnet [09:01:07] !log reboot cp5018 and cp5026 for kernel upgrade (T335835) [09:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:07] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [09:04:14] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev - aborrero@cumin2002" [09:05:06] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:05:20] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev - aborrero@cumin2002" [09:05:20] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:05:40] !log move varnishkafka instances in ulsfo to PKI - T337825 [09:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:43] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [09:06:46] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2004-dev.codfw.wmnet on all recursors [09:06:48] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2004-dev.codfw.wmnet on all recursors [09:07:56] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2004-dev.mgmt.codfw.wmnet on all recursors [09:07:59] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2004-dev.mgmt.codfw.wmnet on all recursors [09:08:16] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS bullseye [09:08:32] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.w... [09:12:04] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5026.eqsin.wmnet [09:12:10] (03PS2) 10Arturo Borrero Gonzalez: cloudservices2004-dev: put into service with new setup [puppet] - 10https://gerrit.wikimedia.org/r/930212 (https://phabricator.wikimedia.org/T338778) [09:13:48] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5018.eqsin.wmnet [09:14:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:19:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:24:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2004-dev: put into service with new setup [puppet] - 10https://gerrit.wikimedia.org/r/930212 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [09:25:38] jouncebot: nowandnext [09:25:38] For the next 0 hour(s) and 34 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T0800) [09:25:38] In 0 hour(s) and 34 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000) [09:25:38] In 0 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000) [09:26:37] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [09:29:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel) [09:30:20] (03PS1) 10Arturo Borrero Gonzalez: cloudservices2004-dev: fix typo in role assignment [puppet] - 10https://gerrit.wikimedia.org/r/930523 (https://phabricator.wikimedia.org/T338778) [09:31:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2004-dev: fix typo in role assignment [puppet] - 10https://gerrit.wikimedia.org/r/930523 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [09:34:06] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jgiannelos) [09:34:18] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5027.eqsin.wmnet [09:34:18] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5019.eqsin.wmnet [09:34:22] !log reboot cp5019 and cp5027 for kernel upgrade (T335835) [09:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:53] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2004-dev: fix role [puppet] - 10https://gerrit.wikimedia.org/r/930524 (https://phabricator.wikimedia.org/T338778) [09:35:10] (03PS2) 10Arturo Borrero Gonzalez: cloudservices2004-dev: fix role [puppet] - 10https://gerrit.wikimedia.org/r/930524 (https://phabricator.wikimedia.org/T338778) [09:37:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices2004-dev: fix role [puppet] - 10https://gerrit.wikimedia.org/r/930524 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [09:39:59] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [09:41:28] (03CR) 10Cathal Mooney: Validate port block speed combo in server provision script for QFX5120 (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930264 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [09:41:55] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002" [09:42:58] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices2004-dev.private.codfw.wikimedia.cloud - aborrero@cumin2002" [09:42:58] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:43:15] !log aborrero@cumin2002 START - Cookbook sre.dns.wipe-cache cloudservices2004-dev.private.codfw.wikimedia.cloud on all recursors [09:43:18] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudservices2004-dev.private.codfw.wikimedia.cloud on all recursors [09:47:18] (03PS1) 10Clément Goubert: trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489) [09:47:24] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5027.eqsin.wmnet [09:48:18] (03PS2) 10Clément Goubert: trafficserver: Send testwiki traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/930547 (https://phabricator.wikimedia.org/T337489) [09:51:14] (03CR) 10Hnowlan: [C: 03+2] handler.images: remove async from poolcounter release (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [09:51:47] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp5019.eqsin.wmnet [09:51:52] (03CR) 10EoghanGaffney: [C: 03+1] jenkins: add doc rsync password to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [09:53:05] !log installing openssl security updates on buster [09:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:12] PROBLEM - Check systemd state on cp5019 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy_stek_job.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:57:23] !log restarting FPM on mw canaries [09:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:37] (03PS1) 10Elukey: ml-services: tweak readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930550 [09:59:56] (03CR) 10Elukey: [C: 03+2] ml-services: tweak readinessProbe settings for falcon-7b [deployment-charts] - 10https://gerrit.wikimedia.org/r/930550 (owner: 10Elukey) [10:00:03] (03Merged) 10jenkins-bot: handler.images: remove async from poolcounter release [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930001 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:00:06] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000). [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1000) [10:02:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:03:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb1021.eqiad.wmnet with reason: T337961 [10:03:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb1021.eqiad.wmnet with reason: T337961 [10:03:39] T337961: Clean up clouddb1021 - https://phabricator.wikimedia.org/T337961 [10:03:55] (03CR) 10Cathal Mooney: Modify network report to get prefixes for all vlans before checks (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/930222 (https://phabricator.wikimedia.org/T321704) (owner: 10Cathal Mooney) [10:04:31] !log root@clouddb1021.eqiad.wmnet[metawiki]> ALTER TABLE pagelinks ROW_FORMAT=COMPRESSED; (T337961) [10:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:36] !log removed hadoop packages incorrectly labelled for i386 in thirdparty/bigtop15 bullseye-wikimedia [10:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:34] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [10:08:27] (03PS1) 10Muehlenhoff: Provided a dedicated KDC logrotate config and fix service reload [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) [10:08:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [10:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:09:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:14:49] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [10:14:50] (03PS1) 10Hnowlan: thumbor: fix poolcounter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/930552 [10:15:19] !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:16:02] !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:16:43] (03CR) 10Hnowlan: [C: 03+2] thumbor: fix poolcounter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/930552 (owner: 10Hnowlan) [10:17:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [10:17:43] (03Merged) 10jenkins-bot: thumbor: fix poolcounter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/930552 (owner: 10Hnowlan) [10:18:20] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:18:32] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:19:59] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) [10:20:31] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [10:20:37] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:22:36] RECOVERY - Check systemd state on cp5019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:56] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-codfw [10:23:18] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/930556/41739/" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [10:23:38] (03CR) 10EoghanGaffney: [C: 03+2] jenkins: add doc rsync password to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/930517 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [10:23:46] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/929764 (https://phabricator.wikimedia.org/T339223) [10:30:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:30:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [10:30:45] (03PS1) 10Kosta Harlan: Section images: Fix scrolling to placeholder [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) [10:32:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:55] (03PS1) 10Hnowlan: Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 [10:33:48] (03CR) 10Hnowlan: "sigh" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan) [10:33:51] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) [10:34:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-codfw [10:34:36] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:34:55] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:35:07] (03CR) 10Clément Goubert: [C: 03+2] modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [10:36:09] (03Merged) 10jenkins-bot: modules: Add preStop sleep and draining to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) (owner: 10Clément Goubert) [10:37:00] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:37:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:54] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/930556/41740/" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [10:38:20] (03PS6) 10Clément Goubert: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609) [10:40:03] (03CR) 10Mvolz: [C: 03+1] "The patterns look fine but haven't had a chance to test." [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [10:41:40] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:44:09] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929765 [10:51:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:52:47] (03PS6) 10Clément Goubert: utils: Simple dblist_to_urllist.py script [puppet] - 10https://gerrit.wikimedia.org/r/923591 [10:54:08] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:55:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:56:20] !log fabfur@cumin1001 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet [10:57:28] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5028.eqsin.wmnet [10:57:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5020.eqsin.wmnet [10:57:37] !log reboot cp5020 and cp5028 for kernel upgrade (T335835) [10:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:53] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:00:19] (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) [11:00:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:00:45] (03CR) 10CI reject: [V: 04-1] openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:00:47] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929765 (owner: 10PipelineBot) [11:01:37] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929765 (owner: 10PipelineBot) [11:01:44] (03CR) 10CI reject: [V: 04-1] Section images: Fix scrolling to placeholder [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [11:02:41] (03PS4) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) [11:04:08] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:19] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [11:06:31] (03PS5) 10Arturo Borrero Gonzalez: openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) [11:07:48] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:53] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:08:16] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5020.eqsin.wmnet [11:08:33] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5028.eqsin.wmnet [11:09:17] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/930556/41742/" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:11:59] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices2004-dev.codfw.wmnet with OS bullseye [11:12:09] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmnet... [11:13:03] jouncebot: nowandnext [11:13:03] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [11:13:03] In 1 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300) [11:13:04] In 1 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300) [11:13:07] cooool [11:13:18] (03PS2) 10Ladsgroup: Remove nlwiki from windows-1252 encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) [11:13:21] (03CR) 10Ladsgroup: [C: 03+2] Remove nlwiki from windows-1252 encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) (owner: 10Ladsgroup) [11:13:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) (owner: 10Ladsgroup) [11:14:08] (03Merged) 10jenkins-bot: Remove nlwiki from windows-1252 encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930192 (https://phabricator.wikimedia.org/T128154) (owner: 10Ladsgroup) [11:14:39] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] [11:14:43] T128154: Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki - https://phabricator.wikimedia.org/T128154 [11:16:14] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [11:17:05] (03PS1) 10Ladsgroup: Switch five large wikis to extlinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343) [11:26:14] (03PS6) 10EoghanGaffney: gitlab: Add locking to backups [puppet] - 10https://gerrit.wikimedia.org/r/930182 [11:26:23] 11:24:36 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2259.codfw.wmnet', 'mw1366.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw1404.eqiad.wmnet', 'mw2289.codfw.wmnet', 'deploy2002.codfw.wmnet', 'mw1398.eqiad.wmnet', 'mw1486.eqiad.wmnet', 'mw1420.eqiad.wmnet', 'mw2300.codfw.wmnet'] (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host parse1002.eqiad.wmnet [11:26:23] port 22: Connection timed out [11:26:23] (03CR) 10EoghanGaffney: gitlab: Add locking to backups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [11:28:23] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41743/console" [puppet] - 10https://gerrit.wikimedia.org/r/930182 (owner: 10EoghanGaffney) [11:28:48] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs-eqiad [11:29:08] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [11:29:50] (03PS4) 10Samtar: IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930008 (https://phabricator.wikimedia.org/T336763) [11:31:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [11:31:48] I also get a timeout when trying to SSH to parse1002 [11:32:15] (03PS1) 10Kosta Harlan: Section images: update rtl asset with flipped question mark [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930535 (https://phabricator.wikimedia.org/T335207) [11:32:17] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930192|Remove nlwiki from windows-1252 encoding (T128154)]] (duration: 17m 38s) [11:32:20] T128154: Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki - https://phabricator.wikimedia.org/T128154 [11:33:07] claime effie: do you know what's happening? [11:33:14] (parse1002 is unreachable) [11:33:22] Hmm no [11:33:33] it's making scap sad [11:33:49] I'll check [11:33:52] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) p:05Medium→03High We hit the fan once again with the Israeli [[ https://he.wikisource.org/wiki/פקודת_מס_הכנסה | Income... [11:34:51] tahnks [11:35:25] (03CR) 10Ladsgroup: [C: 03+2] Switch five large wikis to extlinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [11:35:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [11:36:41] Amir1: No ssh, no console via rac [11:36:57] I'll pool=inactive it so you can proceed and hard reboot it [11:37:02] (03Merged) 10jenkins-bot: Switch five large wikis to extlinks read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930560 (https://phabricator.wikimedia.org/T335343) (owner: 10Ladsgroup) [11:37:12] (03PS3) 10Hnowlan: trafficserver: route proton requests via the API gateway [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) [11:37:15] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930560|Switch five large wikis to extlinks read new (T335343)]] [11:37:19] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [11:37:22] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema [11:37:55] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=parse1002.eqiad.wmnet [11:38:25] Amir1: depooled, tell me if it's enough to quiet scap [11:38:48] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930560|Switch five large wikis to extlinks read new (T335343)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [11:39:16] !log parse1002 not responding to ssh or console, depooled [11:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema [11:40:09] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on parse1002.eqiad.wmnet with reason: Powercycle [11:40:10] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930586 [11:40:14] thanks [11:40:22] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on parse1002.eqiad.wmnet with reason: Powercycle [11:40:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs-eqiad [11:43:18] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [11:43:52] It's back up, Amir1 tell me when you're done with your deployments and I'll scap pull/repool [11:44:17] sure, thanks. it'll be done in a minute or two [11:45:19] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw [11:46:26] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930560|Switch five large wikis to extlinks read new (T335343)]] (duration: 09m 10s) [11:46:29] T335343: Set externallinks migration stage to read new on beta and production - https://phabricator.wikimedia.org/T335343 [11:47:01] (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:48:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [11:48:59] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [11:49:41] (03CR) 10Kamila Součková: [C: 03+1] Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan) [11:49:53] !log restarting slapd on seagorgium/serpens [11:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:29] Amir1: sorry I am at lunch, can I help ? [11:50:29] (03PS5) 10Samtar: IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930008 (https://phabricator.wikimedia.org/T336763) [11:50:36] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: codfw1dev: pdns: use modern auth server for forward zones [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:50:39] effie: all good, it's handled [11:50:42] yup [11:50:46] claime: I'm done [11:50:46] cool thank you claime [11:50:50] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "merging since this is a NOOP for eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/930556 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:51:02] Amir1: Great, scap pulling on parse1002 and putting it back in the pool [11:51:19] <3 [11:51:31] !log Repooled parse1002.eqiad.wmnet after powercycle [11:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [11:52:44] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS bullseye [11:52:56] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.w... [11:54:30] (03CR) 10Kamila Součková: [C: 03+1] api-gateway: add device-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/930214 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [11:58:39] !log restarting exim on lists1001 [11:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:48] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1002.eqiad.wmnet [12:02:49] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1002.eqiad.wmnet [12:03:24] (03PS1) 10Stevemunene: analytics: Decommission analytics106[1-3] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930580 (https://phabricator.wikimedia.org/T317861) [12:03:31] (03PS1) 10Stevemunene: analytics: Remove analytics106[1-3] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930581 (https://phabricator.wikimedia.org/T317861) [12:03:37] (03PS1) 10Stevemunene: analytics: Decommission analytics106[4-6] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930582 (https://phabricator.wikimedia.org/T317861) [12:03:39] (03PS1) 10Stevemunene: analytics: Remove analytics106[4-6] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930583 (https://phabricator.wikimedia.org/T317861) [12:03:43] (03PS1) 10Stevemunene: analytics: Decommission analytics106[7-8] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930584 (https://phabricator.wikimedia.org/T317861) [12:03:45] (03PS1) 10Stevemunene: analytics: Remove analytics106[7-8] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930585 (https://phabricator.wikimedia.org/T317861) [12:03:47] (03PS1) 10Stevemunene: analytics: Decommission analytics1069 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930606 (https://phabricator.wikimedia.org/T317861) [12:03:49] (03PS1) 10Stevemunene: analytics: Remove analytics1069 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930607 (https://phabricator.wikimedia.org/T317861) [12:05:11] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10Jclark-ctr) Replaced optic. Cleaned fiber on device side and on pp (port serial 21615538) cable id 5249 [12:06:52] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5029.eqsin.wmnet [12:06:54] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5021.eqsin.wmnet [12:07:08] !log reboot cp5021 and cp5029 for kernel upgrade (T335835) [12:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:09] (03PS1) 10Slyngshede: Keymanagement: Handle MariaDB constraint limitation. [software/bitu] - 10https://gerrit.wikimedia.org/r/930608 [12:11:32] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage [12:12:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ArielGlenn) >>! In T334955#8929123, @Papaul wrote: >... The 2 nodes are ready. Thank you Thank you, we'll take 'em! :-) [12:13:29] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41744/console" [puppet] - 10https://gerrit.wikimedia.org/r/929713 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [12:14:01] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage [12:17:39] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5021.eqsin.wmnet [12:18:08] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5029.eqsin.wmnet [12:18:10] (03PS1) 10AikoChou: changeprop: remove match on specific wiki_id for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) [12:19:35] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [12:20:20] (03CR) 10Vgutierrez: [C: 04-1] trafficserver: route proton requests via the API gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929674 (https://phabricator.wikimedia.org/T324678) (owner: 10Hnowlan) [12:27:41] !log installing containerd security updates [12:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:42] (03PS1) 10AikoChou: ml-services: update revert-risk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/930613 [12:31:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/930608 (owner: 10Slyngshede) [12:32:39] (03PS1) 10Samtar: IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763) [12:32:45] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Keymanagement: Handle MariaDB constraint limitation. [software/bitu] - 10https://gerrit.wikimedia.org/r/930608 (owner: 10Slyngshede) [12:33:21] (03CR) 10CI reject: [V: 04-1] IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [12:33:48] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: recursor: drop IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778) [12:33:57] (03PS2) 10Samtar: IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763) [12:34:45] !log joal@deploy1002 Started deploy [airflow-dags/analytics@d458338]: (no justification provided) [12:34:54] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@d458338]: (no justification provided) (duration: 00m 09s) [12:35:54] !log installing ffmpeg security updates [12:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:57] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) @ayounsi removed 8 cables. deleted from netbox [12:37:38] (03PS1) 10Samtar: IS-Labs: Enable Phonos everywhere, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930617 (https://phabricator.wikimedia.org/T336763) [12:38:24] jouncebot: nowandnext [12:38:24] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:24] In 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300) [12:38:24] In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300) [12:40:37] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5030.eqsin.wmnet [12:40:39] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5022.eqsin.wmnet [12:40:47] !log reboot cp5022 and cp5030 for kernel upgrade (T335835) [12:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:04] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) [12:41:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930617 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [12:41:10] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) 05Open→03Resolved Awesome, thanks! [12:42:11] (03Merged) 10jenkins-bot: IS-Labs: Enable Phonos everywhere, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930617 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [12:43:08] (03PS2) 10Arturo Borrero Gonzalez: openstack: pdns: recursor: drop IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778) [12:45:33] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [12:46:04] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/930616/41747/" [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [12:46:24] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339168 (10Jclark-ctr) a:03Jclark-ctr [12:47:41] (03PS3) 10Samtar: IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763) [12:48:00] !log stevemunene@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [12:48:13] (03PS2) 10Samtar: Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [12:48:16] (03PS2) 10Samtar: beta: log click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280 (owner: 10Jsn.sherman) [12:51:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5030.eqsin.wmnet [12:53:26] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5022.eqsin.wmnet [12:53:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:56:10] RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [12:57:11] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:57:24] (03PS1) 10Btullis: Update the mediawiki_history_reduced sna[pshot to AQS [puppet] - 10https://gerrit.wikimedia.org/r/930620 [12:57:42] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:58:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:58:39] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:59:07] (03PS4) 10Hokwelum: Modify the global blocks script to override output dir via a command line arg [puppet] - 10https://gerrit.wikimedia.org/r/928861 [12:59:09] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1300). [13:00:06] duesen, JSherman, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [13:00:39] present and ready [13:00:43] o/ [13:00:59] (03Merged) 10jenkins-bot: rest-gateway: add citoid support [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [13:01:40] hi, i added one more thing to the window [13:02:39] I can’t deploy yet, sorry (maybe at :30 or so) [13:05:19] * TheresNoTime can deploy [13:05:36] Ah, I was just about to say I can also self-service :) [13:05:51] duesen: feel free, but I don't mind :) [13:05:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [13:06:30] I'll need help, but mine is a beta only config change, so hopefully it will be an easy one. [13:06:48] TheresNoTime: I'll do it. [13:07:02] duesen: go ahead, ping me when you're done? [13:07:33] will do [13:07:34] merging now [13:07:38] (03CR) 10Daniel Kinzler: [C: 03+2] Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:07:47] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:08:10] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [13:08:25] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:08:29] (03Merged) 10jenkins-bot: Switch VisualEditor to bypass RESTbase on all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/929364 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:08:54] starting backport [13:09:09] I’m here [13:09:17] !log daniel@deploy1002 Started scap: Backport for [[gerrit:929364|Switch VisualEditor to bypass RESTbase on all wikis. (T320529)]] [13:09:20] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [13:09:48] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [13:10:13] (03CR) 10Samtar: [C: 03+2] "prep for deploy" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930535 (https://phabricator.wikimedia.org/T335207) (owner: 10Kosta Harlan) [13:10:19] kostajh: set those merging ^ [13:10:41] !log daniel@deploy1002 daniel: Backport for [[gerrit:929364|Switch VisualEditor to bypass RESTbase on all wikis. (T320529)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:10:46] JSherman: for beta patches, you can just +2 those, no need to put in a deployment window. (AFAIK, someone correct me if I'm wrong please) [13:10:51] hm, I just realized I need to also change this for labs. How do I even deploy a config change for labs? [13:11:24] duesen: it will apply automatically to beta [13:11:36] overrides for Labs are in InitialiseSettings-labs.php [13:11:40] TheresNoTime: thank you! [13:13:26] testing on debug looks good. [13:13:36] kostajh: ok thanks. [13:13:37] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5023.eqsin.wmnet [13:13:38] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5031.eqsin.wmnet [13:13:43] !log reboot cp5023 and cp5031 for kernel upgrade (T335835) [13:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:13] kostajh: that's good to know; if one of the deployers can confirm, I'm happy to just +2 it. [13:14:47] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [13:15:14] ok, syncing [13:16:27] (03CR) 10CDanis: [C: 03+1] "Love it, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [13:16:40] JSherman: that's correct for IS-labs/CS-labs, it'll sync to the beta cluster every ~10m once +2'd — iirc it might show as other changes present to the deployer in the next window [13:17:55] (03CR) 10Samtar: [C: 03+2] "deploy, prod no/op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280 (owner: 10Jsn.sherman) [13:17:59] TheresNoTime: Ok; I'll go ahead and +2. Thanks! [13:18:12] JSherman: just did :D [13:18:22] :-) [13:18:47] (03Merged) 10jenkins-bot: beta: log click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930280 (owner: 10Jsn.sherman) [13:19:16] RECOVERY - MariaDB Replica Lag: s4 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:23] Now I know for next time. If I do self service a beta config change in the future, should I still wait for a window and hang out here to do it, or is it a whenever thing? [13:19:46] (03PS1) 10Daniel Kinzler: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930626 [13:19:48] (03PS1) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) [13:20:25] JSherman: I'd still double-check there's nothing going on in here, and maybe just announce you're going to do it? [13:21:06] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:929364|Switch VisualEditor to bypass RESTbase on all wikis. (T320529)]] (duration: 11m 48s) [13:21:10] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [13:21:12] TheresNoTime: ack; sounds reasonable. [13:21:21] TheresNoTime: can I deploy this one as well, so labs is in sync? https://gerrit.wikimedia.org/r/930627 Doesn't have to be now, but it shouldn't be out of whack for too long. [13:21:35] ok, config deployed to all prod wikis. [13:21:40] duesen: go ahead :) [13:21:41] Monitoring metrics [13:21:44] (03CR) 10Vgutierrez: [C: 03+1] "awesome work, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [13:21:55] duesen: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930626/1 looks wrong to me [13:22:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) I will be working with @Clement_Goubert today at 10am CT to relocate those mw nodes. [13:22:08] TheresNoTime: Do you think it's ok to deploy it while monitoring metrics to see if we need to revert the first one? [13:22:19] (03PS1) 10Herron: thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995) [13:22:34] kostajh: ah, right. i messed up the rebase [13:22:43] duesen: wait one, bad rebase (?) yeah [13:23:06] (03PS2) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) [13:23:19] (03Abandoned) 10Daniel Kinzler: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930626 (owner: 10Daniel Kinzler) [13:23:57] (03CR) 10Elukey: [C: 03+1] analytics: Decommission analytics106[1-3] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930580 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [13:24:14] duesen: so far so good? [13:24:16] kostajh: fixed. looks good now? [13:24:24] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5023.eqsin.wmnet [13:24:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:28] effie: stash access is going up. still looking. [13:24:39] ok cool, I will take a look too [13:24:43] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5031.eqsin.wmnet [13:24:46] kostajh, TheresNoTime: do you think i can merge the patch for labs? [13:25:04] (03PS1) 10Hnowlan: add discovery records for rest-gateway and device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/930631 (https://phabricator.wikimedia.org/T335505) [13:25:17] duesen: can I start the GrowthExperiments backports? [13:26:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:26:21] (03CR) 10Kosta Harlan: Switch VisualEditor to bypass RESTbase on labs. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:26:48] stash writes doubled, 60 -> 130/sec [13:27:42] kostajh: does it still need merging? code or config? [13:28:13] ...small bump in sql writes... [13:28:31] ...small bump in network utilization on db hosts [13:28:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [13:29:05] duesen: I need to deploy the patches for GrowthExperiments [13:29:14] they are to wmf.13 [13:29:16] 2c is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930627 is still a valid config, so can be +2'd if needed. The two GrowthExperiments patches (kostajh) are almost merged [13:29:25] effie: all looking good. stash access seems to stablilize at > 150 per minute [13:29:30] (03CR) 10Hnowlan: [C: 03+2] Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan) [13:30:21] kostajh: sure. can I deploy another config patch while we are waiting for them to merge? [13:31:17] yep [13:31:22] cool [13:31:41] TheresNoTime: will you continue the deployment process for GrowthExperiments patches or do you want me to take over? [13:31:41] ah nice, VE backend transform latency went down by 50% [13:31:53] (I'm joining a meeting so would prefer if you keep moving them forward, if that's alright with you.) [13:32:01] kostajh: I can carry on [13:32:09] ty! [13:32:11] Okay, I verified that my instruments can now push events to that stream. TheresNoTime: and kostajh: thanks! [13:32:15] duesen: are you wanting to deploy a beta config patch now? [13:32:19] JSherman: ack :) [13:32:47] (03CR) 10Daniel Kinzler: Switch VisualEditor to bypass RESTbase on labs. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:33:02] TheresNoTime: yes. merging. [13:33:05] (03CR) 10Daniel Kinzler: [C: 03+2] Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:33:08] ty [13:33:44] (03Merged) 10jenkins-bot: Revert "handler.images: remove async from poolcounter release" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930533 (owner: 10Hnowlan) [13:33:50] (03Merged) 10jenkins-bot: Switch VisualEditor to bypass RESTbase on labs. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [13:34:30] (03PS1) 10Elukey: ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583) [13:34:51] * Lucas_WMDE now around [13:34:56] anything left to deploy or all good? [13:35:06] all of the things [13:35:11] oh, actually [13:35:15] Lucas_WMDE: I'm just about to deploy the two GrowthExperiments patches [13:35:19] ok! [13:35:33] (03PS3) 10Reedy: Revert "Temporarily disable UCoC link from non tech wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) [13:35:42] Lucas_WMDE: ^ if you want to deploy that, I wouldn't complain :) [13:35:53] TheresNoTime: can i run scap on the beta config patch? [13:36:12] duesen: https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/448266/console [13:36:17] it's doing it [13:36:41] (03PS4) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) [13:37:10] TheresNoTime: oh, now I get what kostajh meant by "automatic". Cool :) [13:37:16] (03CR) 10Hnowlan: svg: attempt to build valid locales from hyphenated languages (032 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [13:38:34] (03Merged) 10jenkins-bot: Section images: Fix scrolling to placeholder [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930531 (https://phabricator.wikimedia.org/T335209) (owner: 10Kosta Harlan) [13:38:37] (03Merged) 10jenkins-bot: Section images: update rtl asset with flipped question mark [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930535 (https://phabricator.wikimedia.org/T335207) (owner: 10Kosta Harlan) [13:39:15] !log samtar@deploy1002 Started scap: Backport for [[gerrit:930531|Section images: Fix scrolling to placeholder (T335209)]], [[gerrit:930535|Section images: update rtl asset with flipped question mark (T335207)]] [13:39:21] T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209 [13:39:21] T335207: Section-level images: onboarding dialog - https://phabricator.wikimedia.org/T335207 [13:39:37] Reedy: sure, once everything else is done [13:40:36] (03CR) 10MVernon: [C: 03+1] "I was slightly thrown by the commit saying we aren't hardcoding the port any more. But it's rather that we're moving it into hiera, right?" [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans) [13:40:42] Lucas_WMDE: do you want to take over after these two are done? (just the maintenance script to start, and Re/edy's patch) [13:40:44] !log samtar@deploy1002 kharlan and samtar: Backport for [[gerrit:930531|Section images: Fix scrolling to placeholder (T335209)]], [[gerrit:930535|Section images: update rtl asset with flipped question mark (T335207)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:40:58] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) Next steps: * Roll out the changes to eqsin, and monitor. * Roll out the changes to codfw, and monitor. * Roll out the changes to eqiad, and monitor. * Roll out the ch... [13:40:58] kostajh: both live on mwdebug, can you test? [13:41:51] TheresNoTime: sure, one sec [13:41:57] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] doc: Clean up leftover bits from switch to quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/929713 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [13:42:40] (03PS1) 10Elukey: role::cache::{text,upload}: move vk instances to PKI in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/930633 (https://phabricator.wikimedia.org/T337825) [13:43:12] TheresNoTime: lgtm [13:43:12] (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [13:43:17] syncing [13:43:58] Anyone around for backports? [13:44:14] Winston_Sung[m]: me currently, Lucas_WMDE in a moment probably [13:44:54] Here is the requested change to be backported: https://gerrit.wikimedia.org/r/929647 [13:45:08] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930587 [13:47:02] Lucas_WMDE: y/n on being able to take over? [13:47:08] sure [13:47:20] Winston_Sung[m]: was the idea in https://phabricator.wikimedia.org/T337527#8926660 that the backport should be done before the train reached group2? [13:47:30] because group2 is on wmf.13 now, the train was early today [13:48:04] PROBLEM - Check systemd state on mw1448 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:09] ... [13:48:35] deploy is currently in `php-fpm-restart` [13:48:56] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:930531|Section images: Fix scrolling to placeholder (T335209)]], [[gerrit:930535|Section images: update rtl asset with flipped question mark (T335207)]] (duration: 09m 40s) [13:49:01] T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209 [13:49:01] T335207: Section-level images: onboarding dialog - https://phabricator.wikimedia.org/T335207 [13:49:03] kostajh: live [13:49:13] thanks very much! [13:49:19] (03CR) 10Hnowlan: [C: 03+2] svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [13:49:29] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5032.eqsin.wmnet [13:49:30] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5024.eqsin.wmnet [13:49:34] Lucas_WMDE: all that's left is MatmaRex's script run, and the two last-minute additions [13:49:43] !log reboot cp5024 and cp5032 for kernel upgrade (T335835) [13:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:17] Reedy: want to remove your -1 from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/924567 ? [13:51:13] done [13:51:16] “Each of them will probably take a few weeks to complete” ._. [13:51:20] I’ve never run a maint script that long [13:51:26] just, open a tmux session on mwmaint, and let it rip? [13:51:43] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930587 (owner: 10PipelineBot) [13:51:50] !log installing ruby2.5 security updates [13:51:50] yup :D [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [13:51:58] ok [13:52:00] yes [13:52:06] ~~screen > tmux but ok~~ [13:52:13] i think you can follow what urbanec.m did with the last script [13:52:18] !.kb TheresNoTime [13:52:18] https://phabricator.wikimedia.org/T315510#8929374 [13:52:25] >:D [13:52:31] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930587 (owner: 10PipelineBot) [13:52:46] (03Merged) 10jenkins-bot: Revert "Temporarily disable UCoC link from non tech wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924567 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [13:52:54] TheresNoTime: The 90s called, they want their terminal multiplexer back <3 [13:53:04] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:924567|Revert "Temporarily disable UCoC link from non tech wikis" (T280886)]] [13:53:07] T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886 [13:53:28] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:53:33] claime: I just know how it works without having to look anything up D: [13:53:35] (03Merged) 10jenkins-bot: svg: attempt to build valid locales from hyphenated languages [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/923368 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [13:53:35] brb creating puppet change to uninstall screen /s [13:53:52] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:54:02] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:54:25] !log lucaswerkmeister-wmde@deploy1002 reedy and lucaswerkmeister-wmde: Backport for [[gerrit:924567|Revert "Temporarily disable UCoC link from non tech wikis" (T280886)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:54:33] Lucas_WMDE: Better idea, deploy a global tmuxrc that remaps everything to screen bindings [13:54:35] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:54:39] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:54:41] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:54:46] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:54:57] I see a code of conduct link on https://en.wikipedia.org/wiki/Main_Page on mwdebug [13:55:26] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:55:30] and a Verhaltenskodex at https://de.wikipedia.org/wiki/Wikipedia:Hauptseite [13:55:41] should be good to go I think [13:55:48] RECOVERY - Check systemd state on mw1448 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:01] "Winston_Sung: was the idea in..." <- Lucas_WMDE: The backport should be done after group 2 to wmf.13. [13:56:02] claime: I actually use C-a instead of C-b for tmux ^^ [13:56:07] (but don’t know any other screen bindings) [13:56:18] Winston_Sung[m]: ok, then now would be the right time [13:56:25] (syncing the UCoC change now) [13:56:48] (03CR) 10BBlack: [C: 03+1] "Amazing work, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [14:00:15] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5024.eqsin.wmnet [14:00:31] !log remove ruby2.5 2.5.5-3+deb10u5+wmf1 (superseded by corrected Debian build 2.5.5-3+deb10u6 T338294 [14:00:34] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5032.eqsin.wmnet [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:35] T338294: ruby2.5 2.5.5-3+deb10u5 breaks Puppet - https://phabricator.wikimedia.org/T338294 [14:01:48] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:924567|Revert "Temporarily disable UCoC link from non tech wikis" (T280886)]] (duration: 08m 44s) [14:01:52] T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886 [14:02:04] (03PS3) 10Lucas Werkmeister (WMDE): Revert "Implement Language Converter for yue (Cantonese)" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) (owner: 10Winston Sung) [14:02:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) (owner: 10Winston Sung) [14:07:16] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:07:16] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:07:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:14] Hello! I'm trying to rotate an API key on a mwmaint server. I believe I need to PrivateSettings.php, but can't seem to find that file? [14:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:11:56] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:11:56] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:12:53] jkieserman: if you want to update a PS value, you would need to update the canonical copy on deployment.eqiad.wmnet and deploy it with scap, any changes anywhere else will get overwritten [14:12:56] ryankemper, inflatador: should wdqs2021 be silenced? [14:13:39] 10SRE-tools, 10Spicerack: Service without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Clement_Goubert) [14:16:31] Thanks taavi! A few follow-up questions. (1) we update but sshing into deployment.eqiad.wmnet? (2) where does the PS file live on that server? (3) How do we deploy? (Sorry, total newb :) ) [14:17:34] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:09] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) Thanks for the feedback! Is there a test cluster that wmcs can connect to that we might be able to use with a test instance of gitlab in order to give it a try before we do thi... [14:19:30] (03Merged) 10jenkins-bot: Revert "Implement Language Converter for yue (Cantonese)" [core] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/929647 (https://phabricator.wikimedia.org/T59106) (owner: 10Winston Sung) [14:19:46] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:929647|Revert "Implement Language Converter for yue (Cantonese)" (T59106 T337527)]] [14:19:51] T59106: Implement LanguageConverter for yue (Cantonese) - https://phabricator.wikimedia.org/T59106 [14:19:51] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [14:20:22] jkieserman: (1) not sure what you're asking here, sorry (2) /srv/mediawiki-staging/private/ (3) you should presumably find someone with deployment rights and experience, for example show up here during a backport window [14:20:30] ah, they left :/ [14:21:11] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and wsung: Backport for [[gerrit:929647|Revert "Implement Language Converter for yue (Cantonese)" (T59106 T337527)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:21:17] Winston_Sung[m]: can you test it? [14:21:34] Testing... [14:22:58] Everything looks fine for me. [14:23:06] checking logstash just to be safe [14:23:36] nothing that looks particularly concerning [14:23:37] let’s sync [14:23:43] No console errors, network all HTTP 200. [14:23:56] * HTTP 200 OK. [14:24:03] Yeah. Let's sync. [14:25:11] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: pdns: recursor: drop IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/930616 (https://phabricator.wikimedia.org/T338778) (owner: 10Arturo Borrero Gonzalez) [14:25:12] (i need to step away for a minute, i hope you can still launch my maintenance. thanks) [14:25:24] MatmaRex: yup, wiil do [14:25:27] *will [14:25:38] (03PS1) 10Eigyan: Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) [14:26:19] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [14:26:32] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: attempting WDQS stack on bullseye [14:27:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:43] (03PS1) 10Hnowlan: thumbor: attempt to render hypenated svg languages better [deployment-charts] - 10https://gerrit.wikimedia.org/r/930641 (https://phabricator.wikimedia.org/T337139) [14:29:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:29:06] (03CR) 10Hnowlan: [C: 03+2] thumbor: attempt to render hypenated svg languages better [deployment-charts] - 10https://gerrit.wikimedia.org/r/930641 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [14:29:39] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:929647|Revert "Implement Language Converter for yue (Cantonese)" (T59106 T337527)]] (duration: 09m 53s) [14:29:40] re maintenance script: from https://phabricator.wikimedia.org/T315510#8716277 and https://phabricator.wikimedia.org/T326314, I’m guessing that I should not start with s1, s4 or s8 [14:29:44] T59106: Implement LanguageConverter for yue (Cantonese) - https://phabricator.wikimedia.org/T59106 [14:29:44] T337527: 1.41.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T337527 [14:29:45] since those three are busy backfilling externallinks [14:29:54] (03Merged) 10jenkins-bot: thumbor: attempt to render hypenated svg languages better [deployment-charts] - 10https://gerrit.wikimedia.org/r/930641 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [14:29:56] but maybe I can do s2 and s3 in parallel, for instance [14:30:07] any objections? [14:32:46] yes, that sounds reasonable [14:33:01] i also suggested s5 and s6, since the externallinks work is done there as well [14:33:29] and urbanec.m is already running s7, I see [14:34:16] !log Start `foreachwikiindblist 'group2 & s2' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s2-exited-$?` in tmux on mwmaint1002 (T315510) [14:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:20] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:35:27] !log Start `foreachwikiindblist 'group2 & s3' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s3-exited-$?` in tmux on mwmaint1002 (T315510) [14:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:36] (03CR) 10Klausman: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [14:39:03] !log Start `foreachwikiindblist 'group2 & s5' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s5-exited-$?` in tmux on mwmaint1002 (T315510) [14:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] !log Start `foreachwikiindblist 'group2 & s6' DiscussionTools:persistRevisionThreadItems --current --all; touch ~/T315510-s6-exited-$?` in tmux on mwmaint1002 (T315510) [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:34] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:40:03] !log UTC afternoon backport+config window done (maintenance script runs are ongoing and “will probably take a few weeks to complete”) [14:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:12] thanks Lucas_WMDE [14:40:14] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns: recursor: make it listen in the right address [puppet] - 10https://gerrit.wikimedia.org/r/930642 (https://phabricator.wikimedia.org/T307357) [14:41:01] (03CR) 10Klausman: [C: 03+1] ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [14:43:11] o_O cebwiki and frwiki both have about 11 million rows to update apparently [14:43:31] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339168 (10Jclark-ctr) 05Open→03Resolved Replaced Managment switch [14:43:40] oh wait, this is DiscussionTools, not Flow [14:43:43] then it makes sense I guess [14:44:15] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/930642/41748/" [puppet] - 10https://gerrit.wikimedia.org/r/930642 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:47:29] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Clement_Goubert) [14:48:02] (03CR) 10Jkieserman: [C: 03+1] Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan) [14:50:38] (03CR) 10Stevemunene: [C: 03+2] Use refinery v0.2.16 in refine jobs. [puppet] - 10https://gerrit.wikimedia.org/r/928525 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [14:51:49] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2401.codfw.wmnet with reason: powering off for T326564 [14:51:53] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [14:52:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2401.codfw.wmnet with reason: powering off for T326564 [14:52:07] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2411.codfw.wmnet with reason: powering off for T326564 [14:52:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2411.codfw.wmnet with reason: powering off for T326564 [14:52:34] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2324.codfw.wmnet with reason: powering off for T326564 [14:52:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2324.codfw.wmnet with reason: powering off for T326564 [14:52:47] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) [14:52:52] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2323.codfw.wmnet with reason: powering off for T326564 [14:53:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2323.codfw.wmnet with reason: powering off for T326564 [14:53:48] !log Depooling mw2401 mw2411 mw2324 mw2323 as invalid for powerdown - T326564 [14:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:22] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2401.codfw.wmnet [14:54:39] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2411.codfw.wmnet [14:54:50] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2324.codfw.wmnet [14:54:57] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2323.codfw.wmnet [14:55:18] !log Powering down mw2401 mw2411 mw2324 mw2323 - T326564 [14:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:44] PROBLEM - puppet last run on puppetdb1003 is CRITICAL: CRITICAL: Puppet has been disabled for 604823 seconds, message: testing multi ca support - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:56:43] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Jhancock.wm) [14:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:57:18] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) a:03hashar That is a recurring issue cause the Jenkins jobs are running on static hosts which are not always entirely cleared up after a... [14:58:34] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) [15:00:34] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: codfw1dev: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930647 (https://phabricator.wikimedia.org/T307357) [15:00:36] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) [15:00:37] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:00:49] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:01:29] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: codfw1dev: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930647 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:01:46] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) > The apt cache overflowing, I don't think it is garbage collected `/srv` is 21G on the instances and: | Disk size in MB | Directory |--|... [15:01:51] jouncebot: nowandnext [15:01:51] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [15:01:51] In 0 hour(s) and 58 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1600) [15:01:54] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: eqiad1: drop IPv6 records for DNS services [dns] - 10https://gerrit.wikimedia.org/r/930648 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:03:27] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [15:04:06] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [15:04:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Clement_Goubert) 05Open→03Resolved Host is back in pool, resolving. [15:04:30] (03Merged) 10jenkins-bot: mediawiki: Gracefully handle termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/929706 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [15:06:13] (03CR) 10Andrew Bogott: [C: 03+1] dev env: add a basic puppet enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:07:57] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:09:00] 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have a cronjob to clear the apt cache - https://phabricator.wikimedia.org/T339251 (10hashar) [15:09:43] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic, 10ci-test-error: CI failing with "No space left on device" (debian-glue) - https://phabricator.wikimedia.org/T339171 (10hashar) 05Open→03Resolved I have manually deleted the apt caches which were taking half of the disk space and are never purg... [15:09:46] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:09:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:10:00] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:10:10] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:30] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:10:42] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:11:04] ^ the above alert is my fault [15:11:06] fixing [15:11:16] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:30] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:11:34] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:12:12] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:12:28] !log Deploying new mediawiki chart: Gracefully handle termination - T331609 [15:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:32] T331609: Gracefully handle pod termination in mw-on-k8s - https://phabricator.wikimedia.org/T331609 [15:12:36] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:13:22] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:13:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:13:40] (03CR) 10Ssingh: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [15:14:34] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [15:14:34] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [15:14:48] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [15:14:58] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [15:16:21] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2401.codfw.wmnet [15:16:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2401.codfw.wmnet [15:16:41] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2401.codfw.wmnet [15:16:52] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [15:16:52] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [15:17:05] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [15:17:11] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [15:18:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:18:38] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:20:11] 10SRE, 10Continuous-Integration-Infrastructure: Puppet package_builder module should have a cronjob to clear the apt cache - https://phabricator.wikimedia.org/T339251 (10hashar) pbuilder(8) has an option to clean it automatically: --autocleanaptcache Clean apt cache automatically, to run `apt-get autoc... [15:20:51] (03PS1) 10Hashar: package_builder: autoclean apt cache [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) [15:21:16] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:21:43] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:21:55] !log mw2401.codfw.wmnet repooled following T326564 [15:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:59] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [15:22:09] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:22:16] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:35] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar) [15:23:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:46] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:24:20] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: openstack: codfw1dev: allow cloudservices2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/930654 (https://phabricator.wikimedia.org/T307357) [15:24:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2411.codfw.wmnet [15:24:25] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2411.codfw.wmnet [15:24:48] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:25:26] (03CR) 10JHathaway: [C: 03+2] dev env: add a basic puppet enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:26:07] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2411.codfw.wmnet [15:26:14] (03CR) 10Vgutierrez: [C: 04-1] "you cannot issue Let's Encrypt certificates for internal domains (.wmnet TLD)" [puppet] - 10https://gerrit.wikimedia.org/r/930654 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:26:57] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) @Dzahn Yeah, I created this as a sub-task for that. I will close this first and create another sub-task under (T295416) for decom otrs1001. [15:27:21] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10Arnoldokoth) [15:27:24] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/930653/1965/ gives:" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar) [15:27:25] !log mw2411.codfw.wmnet repooled following T326564 [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:28] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [15:27:57] (03Abandoned) 10Arturo Borrero Gonzalez: acme_chief: openstack: codfw1dev: allow cloudservices2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/930654 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:28:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:28:08] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:28:14] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Puppet package_builder module should have a cronjob to clear the apt cache - https://phabricator.wikimedia.org/T339251 (10hashar) a:03hashar [15:28:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:28:29] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: Decom otrs1001 - https://phabricator.wikimedia.org/T339253 (10Arnoldokoth) [15:29:11] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [15:30:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) 05Resolved→03Open >>! In T338566#8933725, @elukey wrote: > @Papaul thanks! I confirm that it works :) > > I think that there is only one thing to do, namely update the document... [15:30:36] (03CR) 10Hnowlan: [C: 03+1] cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans) [15:31:37] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:33:01] (03PS1) 10Muehlenhoff: ferm: Allow passing the port is a more structured way (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/930656 [15:33:09] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:33:30] (03CR) 10CI reject: [V: 04-1] ferm: Allow passing the port is a more structured way (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/930656 (owner: 10Muehlenhoff) [15:33:53] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:33:57] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: name=mw2324.codfw.wmnet [15:34:54] (03CR) 10Ssingh: [C: 03+1] "Looks good and thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/930653 (https://phabricator.wikimedia.org/T339251) (owner: 10Hashar) [15:35:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Clement_Goubert) >>! In T338566#8935469, @elukey wrote: >>>! In T338566#8933725, @elukey wrote: >> @Papaul thanks! I confirm that it works :) >> >> I think that there is only one thing to... [15:36:36] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2324.codfw.wmnet [15:36:36] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2324.codfw.wmnet [15:36:45] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2324.codfw.wmnet [15:37:21] !log milimetric@deploy1002 Started deploy [analytics/refinery@106bf30]: Patch for HiveToDruid with snapshots [15:37:24] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [15:37:24] PROBLEM - mediawiki-installation DSH group on mw2324 is CRITICAL: Host mw2324 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:37:36] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [15:37:42] ^that's me, it'll fix itself [15:38:02] (03PS2) 10Muehlenhoff: ferm: Allow passing the port is a more structured way (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/930656 [15:38:29] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:38:36] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Dovi) I concur with [[User:Fuzzy]]; a direct solution to this is needed on Hebrew Wikisource. [15:38:52] (03CR) 10JHathaway: [C: 03+1] "looks good, one question" [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [15:39:02] RECOVERY - mediawiki-installation DSH group on mw2324 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:04] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) 05Open→03Resolved This is complete, thanks to @ssingh and @Clement_Goubert [15:39:25] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2323.codfw.wmnet [15:41:14] (03CR) 10Elukey: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [15:43:30] !log mw2324.codfw.wmnet repooled following T326564 [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:34] T326564: codfw: Relocate servers to make space for new switches in rowA and rowB - https://phabricator.wikimedia.org/T326564 [15:44:22] !log milimetric@deploy1002 Finished deploy [analytics/refinery@106bf30]: Patch for HiveToDruid with snapshots (duration: 07m 01s) [15:44:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw2323.codfw.wmnet [15:44:43] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2323.codfw.wmnet [15:44:50] !log mw2323.codfw.wmnet repooled following T326564 [15:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:49] (03CR) 10Elukey: [C: 03+2] ml-services: add more experimental settings for LLMs [deployment-charts] - 10https://gerrit.wikimedia.org/r/930632 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [15:45:57] !log milimetric@deploy1002 Started deploy [analytics/refinery@106bf30] (thin): Patch for HiveToDruid with snapshots [thin] [15:46:01] !log milimetric@deploy1002 Finished deploy [analytics/refinery@106bf30] (thin): Patch for HiveToDruid with snapshots [thin] (duration: 00m 04s) [15:46:32] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:41] (03CR) 10Muehlenhoff: Provided a dedicated KDC logrotate config and fix service reload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [15:51:18] !log phabricator - made jnuche (https://phabricator.wikimedia.org/people/manage/32076/) an Administrator T339174 [15:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:22] T339174: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 [15:51:57] (03PS5) 10JHathaway: apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) [15:53:00] 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) {F37104995} ^ ;) [15:53:06] 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) [15:53:45] 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) @Aklapper see logs and screenshot above:) can you click for me? then this is resolved [15:54:23] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:54:35] (03CR) 10Ssingh: "This is ready for review and also running on traffic-cache-bullseye:" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [15:55:20] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:13] !log joal@deploy1002 Started deploy [airflow-dags/analytics@c584b62]: (no justification provided) [15:56:25] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@c584b62]: (no justification provided) (duration: 00m 12s) [15:57:13] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:57:17] (03CR) 10JHathaway: [C: 03+1] Provided a dedicated KDC logrotate config and fix service reload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930551 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [15:57:31] (03CR) 10JHathaway: [C: 03+2] apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/929787 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [15:58:39] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:58:43] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:59:15] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: openstack: codfw1dev: refresh LDAP certificates [puppet] - 10https://gerrit.wikimedia.org/r/930661 (https://phabricator.wikimedia.org/T307357) [15:59:51] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, thanks for working on this!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [16:00:05] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:06:15] (03CR) 10Vgutierrez: [C: 03+1] "looks good, take into account that cloudservices2005-dev.wikimedia.org will lose the current certificate as soon as this gets merged and a" [puppet] - 10https://gerrit.wikimedia.org/r/930661 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:06:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: openstack: codfw1dev: refresh LDAP certificates [puppet] - 10https://gerrit.wikimedia.org/r/930661 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:10:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:10:32] sigh... [16:10:33] https://letsencrypt.status.io/pages/55957a99e800baa4470002da [16:10:37] ^^ arturo [16:11:11] vgutierrez: hopefully I didn't break it :-^ [16:11:42] nah.. but acme-chief isn't happy with Let's Encrypt being down [16:11:51] I can imagine [16:12:14] (03CR) 10Dzahn: [C: 03+2] "I closed https://phabricator.wikimedia.org/T337382 optimistically" [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [16:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver GET/404 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [16:12:52] PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:58] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [16:13:10] that's kinda expected sadly :) [16:13:25] PROBLEM - Check unit status of acme-chief #page on acmechief2001 is CRITICAL: CRITICAL: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring [16:13:52] * Emperor here from the p.age [16:14:01] nothing to worry about [16:14:13] ack [16:14:18] good-oh :) [16:14:51] acked the page [16:14:56] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for acmechief2001.codfw.wmnet [16:14:56] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2001.codfw.wmnet [16:15:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:10] LOL... wrong cookbook [16:15:56] vgutierrez: how did you know the exact minute I walked away for lunch [16:15:58] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da [16:16:00] 😂 [16:16:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da [16:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver GET/404 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [16:25:58] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Nahum) The Income Tax Ordinance requires a temporrary immediate solution while we continue to ponder the best permanent one. [16:26:34] (03CR) 10Ssingh: [C: 03+2] Release 9.2.1-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/930230 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [16:28:23] (03PS1) 10Hnowlan: images: log key limited by poolcounter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/930664 (https://phabricator.wikimedia.org/T337649) [16:30:49] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [16:34:02] RECOVERY - Check systemd state on vrts2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:30] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10phaultfinder) [16:37:30] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:38:39] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da [16:38:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on acmechief2001.codfw.wmnet with reason: https://letsencrypt.status.io/pages/55957a99e800baa4470002da [16:38:58] refreshed the downtime with a 24h one [16:40:01] vgutierrez: ,3 [16:40:02] <3 [16:41:27] (03CR) 10Btullis: [C: 03+1] role::cache::{text,upload}: move vk instances to PKI in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/930633 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [16:44:24] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:31] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [16:44:41] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) 05Open→03In progress Note: I started to boostrap the node with instructions from https://wikitech.wikimedia.org/wik... [16:45:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:48:07] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly OTRS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10jeremyb-phone) [16:48:52] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) Also, `designate-producer` is complaining about something related to rabbitmq, possibly related to the new IP address:... [16:50:29] (03CR) 10AikoChou: changeprop: remove match on specific wiki_id for outlink stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [16:51:02] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [16:51:24] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [16:52:02] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aborrero@cumin2002" [16:52:03] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices2004-dev.codfw.wmnet with OS bullseye [16:52:14] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudservices2004-dev.codfw.wmnet... [16:52:50] RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:52] !log jnuche@deploy1002 Installing scap version "4.53.0" for 595 hosts [16:55:15] RECOVERY - Check unit status of acme-chief #page on acmechief2001 is OK: OK: Status of the systemd unit acme-chief https://wikitech.wikimedia.org/wiki/Acme-chief%23Monitoring [16:55:45] (03CR) 10Dzahn: [C: 03+1] deployment_server: set user.email and user.name in git config [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) (owner: 10Chad) [16:55:48] !log jnuche@deploy1002 Installing scap version "4.53.0" for 595 hosts [16:59:07] !log jnuche@deploy1002 Installing scap version "4.53.0" for 594 hosts [16:59:34] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10hashar) 05Open→03Resolved a:03Legoktm That was implemented by @... [17:00:04] !log jnuche@deploy1002 Installation of scap version "4.53.0" completed for 594 hosts [17:00:05] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700). [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700) [17:00:54] I should have a developer-portal version to deploy today I think... /me looks [17:01:26] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen): Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 (10hashar) (I think that task was left open to have the list of hosts pa... [17:02:15] !log joal@deploy1002 Started deploy [airflow-dags/analytics@bba655e]: (no justification provided) [17:02:27] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@bba655e]: (no justification provided) (duration: 00m 11s) [17:04:48] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-06-15-114340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930667 [17:06:14] (03CR) 10Krinkle: [C: 03+1] Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [17:06:54] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:07:20] 10SRE, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) also see T218686 (Create Gerrit Administrator right policy) [17:07:25] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10Dzahn) also see T218686 (Create Gerrit Administrator right policy) [17:07:43] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: Puppet package_builder module should have the apt cache auto cleaned - https://phabricator.wikimedia.org/T339251 (10hashar) [17:07:50] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-06-15-114340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930667 (owner: 10BryanDavis) [17:08:10] 10SRE, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Seen): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10Dzahn) Priority was set to low. Just came up once again though with linked requests. [17:08:38] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-06-15-114340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/930667 (owner: 10BryanDavis) [17:09:29] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10Dzahn) +1 to adding Andre, for sure. clinic duty can resolve this like other LDAP group requests. [17:10:05] 10SRE, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) clinic duty can resolve this like other LDAP access requests [17:10:35] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/929400/41751/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/929400 (https://phabricator.wikimedia.org/T307775) (owner: 10Chad) [17:11:49] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:12:11] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:12:23] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:12:52] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:13:01] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:13:35] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:15:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:16:17] (03PS1) 10Bartosz Dziewoński: HelpCompletionTool wasn't added to extension.json [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930541 (https://phabricator.wikimedia.org/T338254) [17:17:01] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Aklapper) p:05High→03Medium [[ https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities | The Pr... [17:19:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:20:12] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10hashar) That follows up @Aklapper joining #releng which is owning the #gerrit service. @thcipriani is the team manager thus I guess him filing the task se... [17:20:34] (03PS1) 10Joal: Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669 [17:21:33] 10SRE, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10hashar) Gerrit Administrators are managed via LDAP `gerritadmin` LDAP group. Thank you to have filed the task which is nice for history purposes. I think it is pretty much sel... [17:22:01] 10SRE, 10Gerrit, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10hashar) [17:22:07] 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for Aklapper - https://phabricator.wikimedia.org/T339173 (10hashar) [17:32:45] (03PS3) 10Hokwelum: Fix up more things in the README for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/928605 (https://phabricator.wikimedia.org/T325232) [17:32:47] (03PS5) 10Hokwelum: Modify the global blocks script to accept output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 [17:32:49] (03PS1) 10Hokwelum: make snapshot101[67] temporary testbed hosts [puppet] - 10https://gerrit.wikimedia.org/r/930671 [17:43:33] 10SRE, 10Phabricator: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10taavi) 05In progress→03Resolved I'm not Andre but done. [17:45:19] (03PS2) 10Snwachukwu: Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669 (owner: 10Joal) [17:52:31] (03PS1) 10Ladsgroup: BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431) [17:52:41] jouncebot: nowandnext [17:52:42] For the next 0 hour(s) and 7 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700) [17:52:42] For the next 0 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1700) [17:52:42] In 0 hour(s) and 7 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1800) [17:53:02] (03CR) 10Ladsgroup: [C: 03+2] BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [17:54:56] (03PS1) 10Ladsgroup: Enable blocked domain list in testwiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431) [17:57:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:58:50] (03PS1) 10Hashar: zuul: replace zuul-gearman.py by gearman-tools [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172) [17:59:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:59:10] (03PS2) 10Hashar: zuul: replace zuul-gearman.py by gearman-tools [puppet] - 10https://gerrit.wikimedia.org/r/930673 (https://phabricator.wikimedia.org/T339172) [18:00:06] jnuche and jeena: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T1800). Please do the needful. [18:01:37] (03PS3) 10Snwachukwu: Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669 (https://phabricator.wikimedia.org/T335308) (owner: 10Joal) [18:08:40] (03PS1) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) [18:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [18:09:04] (03CR) 10CI reject: [V: 04-1] magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott) [18:09:10] (03CR) 10CI reject: [V: 04-1] BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [18:10:16] (03PS2) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) [18:12:00] (03Merged) 10jenkins-bot: BlockedDomains: Add logging in case of hit [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930542 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [18:13:09] (03PS3) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) [18:13:37] (03CR) 10CI reject: [V: 04-1] magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott) [18:13:39] (03PS4) 10Andrew Bogott: magnum: Have podman limit size of logs [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) [18:13:56] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930542|BlockedDomains: Add logging in case of hit (T337431)]] [18:14:00] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [18:14:50] (03CR) 10Andrew Bogott: "The puppet manifest that applies the patch is a nightmare but can you check the diffs to make sure I didn't miss a line and/or reverse the" [puppet] - 10https://gerrit.wikimedia.org/r/930674 (https://phabricator.wikimedia.org/T336586) (owner: 10Andrew Bogott) [18:17:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:25:12] sigh, I lost connection to deploy1002 [18:25:58] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930542|BlockedDomains: Add logging in case of hit (T337431)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [18:26:02] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [18:28:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:35:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:36:02] claime: 18:35:30 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2289.codfw.wmnet', 'mw2300.codfw.wmnet', 'mw1398.eqiad.wmnet', 'mw2259.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw1486.eqiad.wmnet', 'mw1404.eqiad.wmnet', 'mw1366.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'deploy2002.codfw.wmnet'] (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host [18:36:03] parse1002.eqiad.wmnet port 22: Connection timed out [18:36:10] I honestly think this needs a hw check [18:36:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:37:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:37:58] 10SRE, 10Gerrit, 10LDAP-Access-Requests: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) Nah, it's not self-service for SRE. At least not anymore since a certain incident in the past, when sre was specifically removed from gerritadmins and that's... [18:38:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:40:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:42:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:44:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:44:29] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930542|BlockedDomains: Add logging in case of hit (T337431)]] (duration: 30m 33s) [18:44:33] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [18:44:52] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.* [18:45:06] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:45:38] (03PS1) 10Gmodena: mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) [18:46:13] (03PS1) 10Andrew Bogott: Heat and Magnum: include service token with subcalls [puppet] - 10https://gerrit.wikimedia.org/r/930676 (https://phabricator.wikimedia.org/T333874) [18:47:28] (03CR) 10Ladsgroup: [C: 03+2] Enable blocked domain list in testwiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [18:48:10] !log [WDQS] `ryankemper@wdqs2012:~$ sudo pool` [18:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:18] (03Merged) 10jenkins-bot: Enable blocked domain list in testwiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [18:48:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930672 (https://phabricator.wikimedia.org/T337431) (owner: 10Ladsgroup) [18:48:32] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:930672|Enable blocked domain list in testwiki and fawiki (T337431)]] [18:48:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:48:44] (03CR) 10Andrew Bogott: [C: 03+2] Heat and Magnum: include service token with subcalls [puppet] - 10https://gerrit.wikimedia.org/r/930676 (https://phabricator.wikimedia.org/T333874) (owner: 10Andrew Bogott) [18:49:09] (03CR) 10CDanis: [C: 03+2] Update mappings for some countries based on initial Probenet data [dns] - 10https://gerrit.wikimedia.org/r/930293 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [18:49:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:07] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:930672|Enable blocked domain list in testwiki and fawiki (T337431)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [18:50:11] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [18:51:17] (03PS2) 10Gmodena: mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) [18:52:13] (03CR) 10Ryan Kemper: [C: 03+1] query_service: migrate WDQS to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel) [18:53:01] (03CR) 10Ryan Kemper: [C: 03+2] query_service: migrate WDQS to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/929962 (https://phabricator.wikimedia.org/T264181) (owner: 10Gehel) [18:53:21] (03CR) 10Gehel: [C: 04-1] "multiple issues according to PCC, I'll check back once the parent CR is merged." [puppet] - 10https://gerrit.wikimedia.org/r/930191 (owner: 10Gehel) [18:55:08] (03PS3) 10Gmodena: mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) [18:56:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [18:57:56] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:02:33] (03CR) 10Eevans: [C: 03+2] cassandra: do not hardcode monitoring ports [puppet] - 10https://gerrit.wikimedia.org/r/930262 (https://phabricator.wikimedia.org/T338639) (owner: 10Eevans) [19:03:00] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:05:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:05:46] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:06:13] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:930672|Enable blocked domain list in testwiki and fawiki (T337431)]] (duration: 17m 40s) [19:06:17] T337431: Rework MediaWiki:SpamBlacklist - https://phabricator.wikimedia.org/T337431 [19:08:56] (03PS1) 10Milimetric: Revert "Revert "Bump mediawiki_history_reduced version for aqs"" [puppet] - 10https://gerrit.wikimedia.org/r/930543 [19:09:08] (03PS2) 10Milimetric: Revert "Revert "Bump mediawiki_history_reduced version for aqs"" [puppet] - 10https://gerrit.wikimedia.org/r/930543 [19:13:22] (03PS1) 10Gehel: query_service: fix logging configuration for wdqs updater [puppet] - 10https://gerrit.wikimedia.org/r/930678 [19:15:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:16:45] (03CR) 10Ryan Kemper: [C: 03+1] query_service: fix logging configuration for wdqs updater [puppet] - 10https://gerrit.wikimedia.org/r/930678 (owner: 10Gehel) [19:16:47] (03CR) 10Ryan Kemper: [C: 03+2] query_service: fix logging configuration for wdqs updater [puppet] - 10https://gerrit.wikimedia.org/r/930678 (owner: 10Gehel) [19:19:21] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Jhancock.wm) [19:20:24] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:23:30] (03PS2) 10Btullis: Update the mediawiki_history_reduced snapshot to AQS [puppet] - 10https://gerrit.wikimedia.org/r/930620 [19:25:22] (03CR) 10Btullis: [C: 03+2] Update the mediawiki_history_reduced snapshot to AQS [puppet] - 10https://gerrit.wikimedia.org/r/930620 (owner: 10Btullis) [19:25:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:28:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:31:31] (03CR) 10EllenR: [C: 03+1] "you got a +2, but I'll add my 2 cents" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan) [19:32:03] (03PS1) 10Andrew Bogott: neutron policy: policy rules to permit members to create magnum clusters [puppet] - 10https://gerrit.wikimedia.org/r/930681 (https://phabricator.wikimedia.org/T333874) [19:33:13] (03CR) 10Andrew Bogott: [C: 03+2] neutron policy: policy rules to permit members to create magnum clusters [puppet] - 10https://gerrit.wikimedia.org/r/930681 (https://phabricator.wikimedia.org/T333874) (owner: 10Andrew Bogott) [19:38:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [19:39:07] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10Jclark-ctr) a:03Jclark-ctr [19:40:41] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:41:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:45:48] (03PS1) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) [19:46:37] (03CR) 10CI reject: [V: 04-1] [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15) [19:47:18] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T339178 (10Jhancock.wm) tested connection. can ssh into the management port. resolve. [19:47:57] (03PS2) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) [19:48:47] (03PS3) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) [19:49:28] (03CR) 10CI reject: [V: 04-1] [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15) [19:52:07] Uhm "Unexpected ';', expecting ']' in ./wmf-config/core-Permissions.php on line 5536" [19:53:33] oops [19:53:34] :p [19:53:57] you seem to be accidentally removing the closing bracket for the `eliminator` group [19:54:35] Uhm do you mean in line 4263? [19:54:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:55:27] I just fixed the indentation... [19:55:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:25] no, 2583 [19:56:42] Oh [19:57:43] (03PS4) 10Superpes15: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) [19:58:14] Greetings All! [19:58:48] 10SRE, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286 (10ZauberViolino) Is the Wikipedia app is available on Apple's App Store? (My iPad region is US so I cannot check... [19:59:28] Lol fixed thanks taavi didn't see it at all :D [19:59:43] (03Abandoned) 10Milimetric: Revert "Revert "Bump mediawiki_history_reduced version for aqs"" [puppet] - 10https://gerrit.wikimedia.org/r/930543 (owner: 10Milimetric) [19:59:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:00:07] brennen and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230615T2000). [20:00:07] eigyan, MatmaRex, and Superpes: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] hi [20:00:34] o/ [20:00:59] * TheresNoTime looks for brennen [20:01:37] I can do this if you need someone to fill in TheresNoTime [20:01:54] thcipriani: if you wouldn't mind, thank you :) [20:02:01] no problem, on it [20:02:29] (03PS2) 10Thcipriani: Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan) [20:02:57] eigyan: I'll start with your [20:02:58] s [20:03:10] Many thanks thcipriani [20:03:24] * thcipriani fumbles in window manager [20:04:23] i think i'll need to add another change to the window, i'm preparing a revert [20:04:40] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:19] MatmaRex: k [20:06:00] checking into some logspam real quick before starting, sorry for delay [20:08:47] (03PS1) 10Bartosz Dziewoński: Revert "Targets: Use align:'after' instead of actionGroups" [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930545 (https://phabricator.wikimedia.org/T339292) [20:09:03] Uhm Can't test my patch anymore [20:09:23] If someone can.. otherwise I should schedule it next week! [20:10:31] Superpes: what do you mean? [20:10:54] RECOVERY - Host ps1-c6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [20:11:53] (03CR) 10Ottomata: [C: 03+1] mw-page-content-change-enrich: HA in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930675 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [20:12:00] ok, going ahead, sorry for getting distracted by errors :P [20:12:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan) [20:13:25] (03Merged) 10jenkins-bot: Remove GDI survey from RU and JA wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930639 (https://phabricator.wikimedia.org/T338926) (owner: 10Eigyan) [20:13:26] Superpes: the change looks straightforward to me, i think i can verify it once deployed :) [20:13:40] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:930639|Remove GDI survey from RU and JA wikis. (T338926)]] [20:13:44] T338926: Undeploy Community Safety Survey from RU and JA Wikipedias (est. on or after June 14th) - https://phabricator.wikimedia.org/T338926 [20:14:18] (03CR) 10Bartosz Dziewoński: [C: 03+1] [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15) [20:15:03] (03PS1) 10Andrew Bogott: neutron policy: more policy rule changes to support our shared network [puppet] - 10https://gerrit.wikimedia.org/r/930683 (https://phabricator.wikimedia.org/T333874) [20:15:13] !log thcipriani@deploy1002 essexigyan and thcipriani: Backport for [[gerrit:930639|Remove GDI survey from RU and JA wikis. (T338926)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:15:57] ^ eigyan on mwdebug, check please [20:16:13] Excellent checking now [20:16:18] thank you [20:17:44] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10Jclark-ctr) Rebooted Msw [20:18:30] All is well thcipriani thank you for your all you do! [20:18:43] great! going live everywhere now [20:18:51] Suhweeet! [20:20:54] (03CR) 10Andrew Bogott: [C: 03+2] neutron policy: more policy rule changes to support our shared network [puppet] - 10https://gerrit.wikimedia.org/r/930683 (https://phabricator.wikimedia.org/T333874) (owner: 10Andrew Bogott) [20:22:41] (03CR) 10Ottomata: [C: 03+2] Move spark_jobs from spark2 to spark3 [puppet] - 10https://gerrit.wikimedia.org/r/930669 (https://phabricator.wikimedia.org/T335308) (owner: 10Joal) [20:23:39] > ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out [20:23:41] hrmmmm [20:23:55] is that known? /me checks sal [20:25:21] doesn't look like anything is happening with it that's been logged [20:25:38] (03PS1) 10Ottomata: Remove reference to absent ::druid_load classes [puppet] - 10https://gerrit.wikimedia.org/r/930684 (https://phabricator.wikimedia.org/T335308) [20:26:09] (03CR) 10Ottomata: [C: 03+2] Remove reference to absent ::druid_load classes [puppet] - 10https://gerrit.wikimedia.org/r/930684 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [20:26:34] thcipriani: we failed to intall scap to that this morning as well [20:27:10] jaime said it happens sometimes and the changes weren't relevant to it anyway [20:27:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [20:27:53] jeena: oh, thanks for the note. That'd be nice to fix. Have to wait for timeouts :( [20:30:10] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:930639|Remove GDI survey from RU and JA wikis. (T338926)]] (duration: 16m 30s) [20:30:14] T338926: Undeploy Community Safety Survey from RU and JA Wikipedias (est. on or after June 14th) - https://phabricator.wikimedia.org/T338926 [20:30:22] and, yeah, see it got downtimed earlier today for the same reason [20:30:31] ^ eigyan should be live everywhere [20:30:56] I'll have a look thcipriani [20:31:08] thanks [20:31:15] MatmaRex: you're up [20:31:23] yup [20:31:31] (03CR) 10Thcipriani: [C: 03+2] HelpCompletionTool wasn't added to extension.json [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930541 (https://phabricator.wikimedia.org/T338254) (owner: 10Bartosz Dziewoński) [20:31:37] (03CR) 10Thcipriani: [C: 03+2] Revert "Targets: Use align:'after' instead of actionGroups" [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930545 (https://phabricator.wikimedia.org/T339292) (owner: 10Bartosz Dziewoński) [20:31:49] and sorry I should have been backporting these the whole time [20:31:58] er...should have +2'd them [20:32:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1149'] [20:32:27] I'll jump ahead to Superpes while we wait for jenkins [20:33:21] Superpes: are you ready for deploy for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/930682/ ? [20:36:26] thcipriani: i think they said they had to leave, but i can verify that change [20:36:43] oh, ok, thanks MatmaRex going ahead [20:37:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15) [20:38:25] (03Merged) 10jenkins-bot: [uzwiki] Add the 'patroller' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930682 (https://phabricator.wikimedia.org/T338826) (owner: 10Superpes15) [20:38:40] Thank you thcipriani all is well signing off for now... [20:38:41] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:930682|[uzwiki] Add the 'patroller' usergroup (T338826)]] [20:38:46] T338826: Request to activate patroller user group on uzwiki - https://phabricator.wikimedia.org/T338826 [20:40:03] !log thcipriani@deploy1002 superpes and thcipriani: Backport for [[gerrit:930682|[uzwiki] Add the 'patroller' usergroup (T338826)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:40:56] ^ MatmaRex on mwdebug, check please [20:41:04] eigyan: thank you, see ya [20:42:05] thcipriani: looks good, i see the group at https://uz.wikipedia.org/wiki/Maxsus:ListGroupRights as expected [20:42:33] cool, thank you for volunteering as tribute, going live [20:42:39] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [20:42:47] (03PS1) 10Ottomata: refine - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930706 (https://phabricator.wikimedia.org/T335308) [20:45:20] heh [20:46:39] (03CR) 10Ottomata: [C: 03+2] refine - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930706 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [20:48:08] (03PS1) 10Ottomata: refine_test - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930708 (https://phabricator.wikimedia.org/T335308) [20:50:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1149'] [20:51:11] * thcipriani waits on timeouts for parse1002... [20:52:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [20:53:40] (03Merged) 10jenkins-bot: HelpCompletionTool wasn't added to extension.json [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930541 (https://phabricator.wikimedia.org/T338254) (owner: 10Bartosz Dziewoński) [20:53:43] (03Merged) 10jenkins-bot: Revert "Targets: Use align:'after' instead of actionGroups" [extensions/VisualEditor] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/930545 (https://phabricator.wikimedia.org/T339292) (owner: 10Bartosz Dziewoński) [20:54:09] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:930682|[uzwiki] Add the 'patroller' usergroup (T338826)]] (duration: 15m 27s) [20:54:12] T338826: Request to activate patroller user group on uzwiki - https://phabricator.wikimedia.org/T338826 [20:54:33] ^ MatmaRex Superpes should be live everywhere now [20:55:17] thanks [20:55:25] MatmaRex: any harm deploying both of these at the same time? [20:55:30] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/929749 (https://phabricator.wikimedia.org/T335610) (owner: 10Cwhite) [20:55:36] thcipriani: no, that should be okay [20:55:52] cool, I'll do that [20:57:21] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:930545|Revert "Targets: Use align:'after' instead of actionGroups" (T339292)]], [[gerrit:930541|HelpCompletionTool wasn't added to extension.json (T338254)]] [20:57:25] T339292: Issues with gadgets adding tools to VisualEditor "Page options" dropdown (ve.init.Target.actionGroups[1] is undefined) - https://phabricator.wikimedia.org/T339292 [20:57:26] T338254: Expose toolbar search feature in toolbar itself - https://phabricator.wikimedia.org/T338254 [20:58:45] !log thcipriani@deploy1002 thcipriani and matmarex: Backport for [[gerrit:930545|Revert "Targets: Use align:'after' instead of actionGroups" (T339292)]], [[gerrit:930541|HelpCompletionTool wasn't added to extension.json (T338254)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:59:00] ^ should be on mwdebug, check please [20:59:09] looking [20:59:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Tue 20 Jun 2023 04:41:39 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:01:07] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans) [21:01:08] thcipriani: both changes look good [21:01:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:01:24] MatmaRex: okie doke, going live everywhere [21:01:58] Decommission cassandra-a, cassandra-dev2001 — T339304 [21:01:58] T339304: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 [21:02:14] (03Abandoned) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [21:03:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [21:08:29] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye [21:08:36] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host cassandra-dev2001.codfw.wmnet with OS bullseye [21:11:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [21:11:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [21:12:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1150'] [21:12:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [21:12:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1150'] [21:13:30] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:930545|Revert "Targets: Use align:'after' instead of actionGroups" (T339292)]], [[gerrit:930541|HelpCompletionTool wasn't added to extension.json (T338254)]] (duration: 16m 09s) [21:13:35] T339292: Issues with gadgets adding tools to VisualEditor "Page options" dropdown (ve.init.Target.actionGroups[1] is undefined) - https://phabricator.wikimedia.org/T339292 [21:13:35] T338254: Expose toolbar search feature in toolbar itself - https://phabricator.wikimedia.org/T338254 [21:13:41] ^ alright MatmaRex all done [21:13:48] thanks thcipriani [21:13:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [21:14:02] !log parse1002 having ssh connection problems during backport window [21:14:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1150'] [21:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:12] thanks for all the checking MatmaRex o/ [21:17:14] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [21:17:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [21:19:01] !log jhancock@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [21:19:10] !log jhancock@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [21:21:11] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [21:21:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [21:24:16] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [21:26:42] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [21:28:24] Thanks thcipriani and MatmaRex :) [21:28:34] (03PS1) 10Phedenskog: Remove oversampling for Navigation Timing extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) [21:28:52] Unfortunately I had a sudden commitment! [21:29:05] Superpes: it happens, thanks for the patch [21:29:57] (03PS1) 10Ottomata: refine & spark_job - parameterize spark_submit executable path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) [21:30:15] (03CR) 10Ottomata: [C: 03+2] refine_test - Use trailing / for schema base uris [puppet] - 10https://gerrit.wikimedia.org/r/930708 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [21:30:32] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1151'] [21:30:44] (03CR) 10CI reject: [V: 04-1] refine & spark_job - parameterize spark_submit executable path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [21:31:14] (03PS2) 10Ottomata: refine & spark_job - parameterize spark_submit path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) [21:31:47] (03CR) 10CI reject: [V: 04-1] refine & spark_job - parameterize spark_submit path and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [21:34:33] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Investigate missing WikibaseQualityConstraints logs in logstash. - https://phabricator.wikimedia.org/T214031 (10colewhite) Might be related to how MediaWiki logging is configured? Some messages get through like jobrunner and some messa... [21:34:43] (03PS3) 10Ottomata: refine - parameterize spark_submit and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) [21:35:37] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41757/console" [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [21:36:58] (03CR) 10Ottomata: [V: 03+1 C: 03+2] refine - parameterize spark_submit and use spark2 for RefineSanitize [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [21:39:47] (03CR) 10Krinkle: [C: 03+1] "LGTM. Can be deployed any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930712 (https://phabricator.wikimedia.org/T337858) (owner: 10Phedenskog) [21:40:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1151'] [21:40:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1152'] [21:41:53] (03PS1) 10Ottomata: refine_sanitize - Fix typo in spark_submit path [puppet] - 10https://gerrit.wikimedia.org/r/930714 (https://phabricator.wikimedia.org/T335308) [21:42:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] refine_sanitize - Fix typo in spark_submit path [puppet] - 10https://gerrit.wikimedia.org/r/930714 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [21:50:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1152'] [21:59:47] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye [21:59:52] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host cassandra-dev2001.codfw.wmnet with OS bullseye completed: - cassan... [22:01:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153'] [22:08:59] (PuppetDisabled) firing: Puppet disabled on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [22:12:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1153'] [22:13:13] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans) p:05Triage→03Medium [22:14:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154'] [22:14:06] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [22:14:07] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:14:28] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye [22:14:36] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host cassandra-dev2002.codfw.wmnet with OS bullseye [22:17:34] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:18:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [22:21:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - pt1979@cumin2002" [22:21:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:28:08] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dancy) [22:28:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154'] [22:28:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1155'] [22:30:16] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [22:32:21] (03PS1) 10Cwhite: backport orchestrator fields from ECS 8.8 [software/ecs] - 10https://gerrit.wikimedia.org/r/930597 (https://phabricator.wikimedia.org/T292881) [22:33:15] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [22:38:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1155'] [22:38:58] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156'] [22:40:56] (03PS1) 10EoghanGaffney: registry: Add nginx logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/930719 (https://phabricator.wikimedia.org/T322579) [22:43:28] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41758/console" [puppet] - 10https://gerrit.wikimedia.org/r/930719 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [22:46:19] (03PS1) 10Cathal Mooney: Modify Juniper ZTP shell script to use ed25519 keyword [puppet] - 10https://gerrit.wikimedia.org/r/930720 (https://phabricator.wikimedia.org/T336485) [22:49:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1156'] [22:49:42] (03CR) 10Papaul: [V: 03+1] Modify Juniper ZTP shell script to use ed25519 keyword [puppet] - 10https://gerrit.wikimedia.org/r/930720 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [22:50:09] (03CR) 10Cathal Mooney: [C: 03+2] Modify Juniper ZTP shell script to use ed25519 keyword [puppet] - 10https://gerrit.wikimedia.org/r/930720 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [22:52:03] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156'] [22:53:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1156'] [22:54:18] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1156'] [22:56:59] (PuppetDisabled) firing: Puppet disabled on puppetdb1003:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=puppet&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [22:57:28] (03PS1) 10Papaul: Add an-worker11[49-56] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/930724 (https://phabricator.wikimedia.org/T327295) [23:00:13] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T339100 (10wiki_willy) a:03Jclark-ctr [23:01:20] 10SRE, 10ops-eqiad, 10DC-Ops: Relabel: puppetserver1005 to puppetserver1001 - https://phabricator.wikimedia.org/T338326 (10wiki_willy) a:03Jclark-ctr [23:02:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1156'] [23:02:50] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1155'] [23:03:19] (03CR) 10Papaul: [C: 03+2] Add an-worker11[49-56] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/930724 (https://phabricator.wikimedia.org/T327295) (owner: 10Papaul) [23:07:16] (03PS1) 10Cathal Mooney: Allow MGMT ranges to make TFTP requests to install server [puppet] - 10https://gerrit.wikimedia.org/r/930727 (https://phabricator.wikimedia.org/T336485) [23:08:49] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339264 (10Jclark-ctr) 05Open→03Resolved link restored on servers in C6 [23:09:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1155'] [23:10:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154'] [23:10:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154'] [23:12:20] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153'] [23:16:02] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye [23:16:09] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host cassandra-dev2002.codfw.wmnet with OS bullseye completed: - cassan... [23:20:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1153'] [23:20:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1152'] [23:20:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1152'] [23:21:08] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1151'] [23:21:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1151'] [23:23:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154'] [23:24:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154'] [23:26:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet [23:30:30] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1154'] [23:30:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1154'] [23:31:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [23:31:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [23:37:00] 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) [23:37:22] 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) p:05Triage→03Medium [23:38:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [23:39:14] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10Eevans) [23:42:28] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS bullseye [23:42:34] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade cassandra-dev cluster to bullseye - https://phabricator.wikimedia.org/T339304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host cassandra-dev2003.codfw.wmnet with OS bullseye [23:42:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [23:43:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [23:43:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153'] [23:44:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1153'] [23:44:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1153'] [23:44:58] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['an-worker1153'] [23:45:58] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet [23:45:59] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:46:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1150'] [23:46:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1150'] [23:47:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:47:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a1-codfw.mgmt.codfw.wmnet [23:51:47] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-a2-codfw.mgmt.codfw.wmnet [23:51:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:52:27] 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw:basic spines/leaves configuration using ZTP - https://phabricator.wikimedia.org/T339315 (10Papaul) [23:54:33] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - pt1979@cumin2002" [23:55:20] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - pt1979@cumin2002" [23:55:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:55:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1149'] [23:56:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a2-codfw.mgmt.codfw.wmnet [23:56:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:57:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1149'] [23:58:39] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage