[00:01:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:06] (03PS4) 10Dzahn: cloudweb2002-dev (labtestwikitech): purge mediawiki font packages [puppet] - 10https://gerrit.wikimedia.org/r/736927 (https://phabricator.wikimedia.org/T294378) [00:04:10] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/32148/" [puppet] - 10https://gerrit.wikimedia.org/r/736927 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [00:04:41] (03CR) 10Dzahn: [C: 03+2] cloudweb2002-dev (labtestwikitech): purge mediawiki font packages [puppet] - 10https://gerrit.wikimedia.org/r/736927 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [00:04:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:39] !log https://labtestwikitech.wikimedia.org - purging mediawiki font packages from backend server [00:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:52] (03CR) 10Dzahn: "alright, you can now look at https://labtestwikitech.wikimedia.org/wiki/Main_Page and see an actual wiki on a server without the font pack" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [00:12:58] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:15:04] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:16:02] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:08] !log phab1001 - sudo systemctl start phabricator_clean_tmp_files.service because Icinga alerted it had failed... worked fine [00:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:22] 10SRE: Renice application server services - https://phabricator.wikimedia.org/T79395 (10Dzahn) [00:29:00] (03CR) 10BryanDavis: [C: 03+1] wikitech::web: remove font packages from wikitech servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [00:35:30] (03CR) 10Dzahn: "cool, thanks for confirming. let me merge it tomorrow morning, gotta step away from keyboard for now" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [00:45:30] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: Character encoding issues on daily-image-l - https://phabricator.wikimedia.org/T295096 (10Legoktm) Seems to be an issue specifically with hyperkitty (the archives), the actual email looks fine. [00:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [01:00:00] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [03:03:16] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [03:05:32] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:05:33] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:08:40] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 85.58 ms [03:08:52] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [03:11:40] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.58 ms [03:14:56] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:24:34] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:24:48] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/736945 (https://phabricator.wikimedia.org/T290868) [05:29:21] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/736945 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [05:31:04] !log Upgrade clouddb1020 [05:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:25] !log Upgrade clouddb1016 [05:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:06] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/736831 [05:35:50] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/736831 (owner: 10Marostegui) [06:12:26] (03PS1) 10Marostegui: dbbackups: Switch s8 backup generation from db1116 to db1171 [puppet] - 10https://gerrit.wikimedia.org/r/736946 (https://phabricator.wikimedia.org/T290868) [06:43:44] (03CR) 10Elukey: [C: 03+2] admin: reduce privileges for ml-team-admins [puppet] - 10https://gerrit.wikimedia.org/r/736852 (owner: 10Elukey) [06:43:52] (03PS2) 10Elukey: admin: reduce privileges for ml-team-admins [puppet] - 10https://gerrit.wikimedia.org/r/736852 [06:52:38] (03PS4) 10Elukey: role::statistics::explorer: add ml profile [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) [06:53:48] (03PS5) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [06:53:50] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [06:53:52] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 [06:54:32] (03CR) 10jerkins-bot: [V: 04-1] php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [06:55:57] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [06:56:21] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (owner: 10Giuseppe Lavagetto) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211105T0700) [07:02:15] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [07:02:17] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 [07:04:20] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:04:39] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (owner: 10Giuseppe Lavagetto) [07:04:46] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [07:04:48] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 [07:06:38] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:07:12] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (owner: 10Giuseppe Lavagetto) [07:15:59] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10ayounsi) I set it to 2000 so it stops alerting. [07:19:40] (03PS1) 10Giuseppe Lavagetto: fixup [puppet] - 10https://gerrit.wikimedia.org/r/736980 [07:21:05] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [07:21:06] (03PS4) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 [07:21:42] (03CR) 10jerkins-bot: [V: 04-1] fixup [puppet] - 10https://gerrit.wikimedia.org/r/736980 (owner: 10Giuseppe Lavagetto) [07:23:04] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:23:40] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (owner: 10Giuseppe Lavagetto) [07:24:05] <_joe_> oh sigh did I botch the rebase [07:24:15] (03PS1) 10Elukey: Add fake stat100x's Swift credentials for the ML-team [labs/private] - 10https://gerrit.wikimedia.org/r/736982 [07:24:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake stat100x's Swift credentials for the ML-team [labs/private] - 10https://gerrit.wikimedia.org/r/736982 (owner: 10Elukey) [07:26:01] (03Abandoned) 10Giuseppe Lavagetto: fixup [puppet] - 10https://gerrit.wikimedia.org/r/736980 (owner: 10Giuseppe Lavagetto) [07:28:04] (03PS1) 10Elukey: Fix typo in path name for role::statistics::explorer::ml [labs/private] - 10https://gerrit.wikimedia.org/r/736983 [07:28:24] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix typo in path name for role::statistics::explorer::ml [labs/private] - 10https://gerrit.wikimedia.org/r/736983 (owner: 10Elukey) [07:33:17] (03PS1) 10Elukey: role::statistics::explorer: move hiera values to the correct place [labs/private] - 10https://gerrit.wikimedia.org/r/736984 [07:33:38] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::statistics::explorer: move hiera values to the correct place [labs/private] - 10https://gerrit.wikimedia.org/r/736984 (owner: 10Elukey) [07:34:20] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:35:57] (03PS5) 10Elukey: role::statistics::explorer: add ml profile [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) [07:40:45] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10ayounsi) >>! In T228919#7478724, @Cmjohnson wrote: > the new SCS is racked in A8 u47 Thank you. Could this be prioritized? The old device still alerts regularly. [07:43:09] (03PS6) 10Elukey: role::statistics::explorer: add ml profile [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) [07:44:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32157/console" [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) (owner: 10Elukey) [07:44:32] !log restart scs-a8-eqiad [07:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:28] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::statistics::explorer: add ml profile [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) (owner: 10Elukey) [07:48:17] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 43, down: 2, dormant: 0, excluded: 0, unused: 0: ayounsi Telia 01353995 - The acknowledgement expires at: 2021-11-06 07:47:28. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:02] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Lumen 22314287 - The acknowledgement expires at: 2021-11-06 23:59:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:02] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi Lumen 22314287 - The acknowledgement expires at: 2021-11-06 23:59:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:59] (03PS1) 10Elukey: profile::statistics::explorer::ml: fix model_upload script perms [puppet] - 10https://gerrit.wikimedia.org/r/736985 [07:53:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:58:13] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10ayounsi) I set this device's status to "Failed" in Netbox as it was triggering alerts. [07:58:37] (03CR) 10Elukey: [C: 03+2] profile::statistics::explorer::ml: fix model_upload script perms [puppet] - 10https://gerrit.wikimedia.org/r/736985 (owner: 10Elukey) [07:59:35] (03CR) 10Elukey: role::statistics::explorer: add ml profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736848 (https://phabricator.wikimedia.org/T280467) (owner: 10Elukey) [08:10:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32158/console" [puppet] - 10https://gerrit.wikimedia.org/r/736805 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:11:30] (03CR) 10Elukey: [C: 03+2] Add sslcert::trusted_root_ca [puppet] - 10https://gerrit.wikimedia.org/r/736765 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:11:39] (03CR) 10Elukey: [C: 03+2] profile::kafka::broker: add truststore for pki-based tls certs [puppet] - 10https://gerrit.wikimedia.org/r/736785 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:11:48] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::broker: allow to override super_users [puppet] - 10https://gerrit.wikimedia.org/r/736805 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:11:54] (03PS3) 10Elukey: profile::kafka::broker: allow to override super_users [puppet] - 10https://gerrit.wikimedia.org/r/736805 (https://phabricator.wikimedia.org/T291905) [08:14:10] (03CR) 10Elukey: [C: 03+2] Enable PKI TLS certificates for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/736786 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:22:56] (03CR) 10Muehlenhoff: "check" [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:24:40] (03CR) 10JMeybohm: Implement CFSSL API signer (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [08:25:56] (03PS2) 10Muehlenhoff: Switch puppetdb to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/686583 (https://phabricator.wikimedia.org/T264178) [08:26:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/686583 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [08:26:59] (03PS1) 10Ayounsi: Don't try to add breakout interfaces to interface-range [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/736987 (https://phabricator.wikimedia.org/T289241) [08:31:07] for x2 masters [08:31:12] wrong window! [08:34:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Upgrade x2 masters T295026 [08:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:23] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [08:34:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Upgrade x2 masters T295026 [08:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 6 hosts with reason: Upgrade x2 masters T295026 [08:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 6 hosts with reason: Upgrade x2 masters T295026 [08:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:10] !log installing tmux bugfix updates from Bullseye 11.1 point release [08:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:02] !log installing reportbug bugfix updates from Bullseye 11.1 point release [08:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:52] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [08:44:10] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 438, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:57] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [08:46:17] !log Upgrade db2142 T295026 [08:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:20] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [08:51:09] (03PS1) 10Elukey: Revert kafka PKI settings for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/736990 [08:52:49] !log installing set kvm::machine_version for ganeti-test cluster to pc-i440fx-2.8 T286206 [08:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:53] T286206: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 [08:54:15] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32159/console" [puppet] - 10https://gerrit.wikimedia.org/r/736787 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [08:59:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS buster [08:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:51] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [09:00:11] (03CR) 10Elukey: [C: 03+2] Revert kafka PKI settings for kafka-test1006 [puppet] - 10https://gerrit.wikimedia.org/r/736990 (owner: 10Elukey) [09:01:23] !log apt.wm.org: remove varnish 6.0.8-1wm1 from component main of buster-wikimedia, we use component/varnish6 instead [09:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:32] (03PS1) 10Elukey: profile::kafka::broker: add conditions to truststore deployments [puppet] - 10https://gerrit.wikimedia.org/r/736991 (https://phabricator.wikimedia.org/T291905) [09:03:49] (03PS6) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [09:03:51] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [09:07:32] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32161/console" [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [09:08:01] (03PS4) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [09:08:12] (03PS6) 10Vgutierrez: prometheus::ops: Add haproxy-tls@cache_upload config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) [09:08:14] (03PS5) 10Vgutierrez: site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) [09:09:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [09:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:45] (03PS2) 10Elukey: profile::kafka::broker: add conditions to truststore deployments [puppet] - 10https://gerrit.wikimedia.org/r/736991 (https://phabricator.wikimedia.org/T291905) [09:16:00] (03PS5) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [09:17:03] (03PS1) 10Vgutierrez: cache:haproxy: Require haproxy bpo pin to install it [puppet] - 10https://gerrit.wikimedia.org/r/736993 (https://phabricator.wikimedia.org/T290005) [09:17:08] (03CR) 10JMeybohm: Add simple-cfssl image for development and e2e tests (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:19:52] !log Upgrade db1151 T295026 [09:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:55] T295026: Upgrade x2 to 10.4.21 - https://phabricator.wikimedia.org/T295026 [09:25:54] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 88.10 ms [09:26:24] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:27:03] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2001.codfw.wmnet [09:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2002.codfw.wmnet [09:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:14] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 460.75 ms [09:29:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32163/console" [puppet] - 10https://gerrit.wikimedia.org/r/736991 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:30:00] (03CR) 10Elukey: profile::kafka::broker: add conditions to truststore deployments [puppet] - 10https://gerrit.wikimedia.org/r/736991 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:30:24] (03CR) 10Elukey: [C: 03+2] profile::kafka::broker: add conditions to truststore deployments [puppet] - 10https://gerrit.wikimedia.org/r/736991 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:36:02] (03CR) 10Jbond: Implement CFSSL API signer (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:36:27] (03PS1) 10Muehlenhoff: ganeti: Fix up row configuration for ganeti test cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/736994 (https://phabricator.wikimedia.org/T286206) [09:39:55] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6001.drmrs.wmnet with OS buster [09:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:07] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [09:43:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2002.codfw.wmnet [09:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:02] (03CR) 10David Caro: [C: 03+1] "Neat!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736581 (owner: 10Andrew Bogott) [09:47:44] 10SRE, 10Traffic: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) [09:49:23] 10SRE, 10Traffic: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) [09:50:56] 10SRE, 10Traffic: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10MoritzMuehlenhoff) The puppet code lacks a "priority => 1002", if you want to override "main" (which also has priority=1001). See the comments in the apt::package_from_component... [09:53:52] !log cp[4033-4036]: upgrade varnish to 6.0.8-1wm2 T295120 [09:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:55] T295120: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 [09:54:20] 10SRE, 10Traffic: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) p:05Triage→03Medium [09:56:09] 10SRE, 10Traffic: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) >>! In T295120#7484351, @MoritzMuehlenhoff wrote: > The puppet code lacks a "priority => 1002", if you want to override "main" (which also has priority=1001). See the commen... [09:56:10] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/736787 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:04:50] (03PS1) 10Ema: varnish: give higher priority to component/varnish6 [puppet] - 10https://gerrit.wikimedia.org/r/736995 (https://phabricator.wikimedia.org/T295120) [10:09:49] (03PS4) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) [10:09:51] (03PS6) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [10:14:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/736995 (https://phabricator.wikimedia.org/T295120) (owner: 10Ema) [10:16:50] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10aborrero) [10:20:06] (03CR) 10Ema: [C: 03+2] varnish: give higher priority to component/varnish6 [puppet] - 10https://gerrit.wikimedia.org/r/736995 (https://phabricator.wikimedia.org/T295120) (owner: 10Ema) [10:21:09] 10SRE, 10Traffic, 10Patch-For-Review: Varnish packages installed from the wrong component on host reimage - https://phabricator.wikimedia.org/T295120 (10ema) 05Open→03Resolved [10:29:06] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS buster [10:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:16] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [10:34:35] 10SRE-swift-storage, 10User-Inductiveload: Unable to upload to Commons: uploadstash-file-not-found: Key "187kyl5ozj74.xtav8j.51508.djvu" not found in stash - https://phabricator.wikimedia.org/T278104 (10Yann) HI, I got a similar error while trying to import a 492 MB TIFF via chunked-upload from https://archive... [10:39:21] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: keepalived: introduce service startup delay [puppet] - 10https://gerrit.wikimedia.org/r/736999 (https://phabricator.wikimedia.org/T294956) [10:40:55] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Fix up row configuration for ganeti test cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/736994 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [10:41:44] (03CR) 10Muehlenhoff: "Note: I'm livehacking that setting on cumin2002 to unblock further setup steps for the new test cluster (until a new Spicerack release get" [software/spicerack] - 10https://gerrit.wikimedia.org/r/736994 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [10:42:41] (03PS8) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [10:52:17] (03PS7) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [10:52:19] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [10:53:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32164/console" [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [10:57:50] (03CR) 10David Caro: [C: 03+1] "only some typos and such, nits can be ignored" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [11:00:34] (03Abandoned) 10Arturo Borrero Gonzalez: cloudgw: keepalived: introduce service startup delay [puppet] - 10https://gerrit.wikimedia.org/r/736999 (https://phabricator.wikimedia.org/T294956) (owner: 10Arturo Borrero Gonzalez) [11:01:12] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: Character encoding issues on daily-image-l - https://phabricator.wikimedia.org/T295096 (10Ladsgroup) It's not that related but the footer in Persian in wikifa-admin-l is all ???? instead (both email and hyperkitty) while the footer file is correct. I th... [11:04:25] (03CR) 10Hnowlan: [C: 03+2] maps: Add script to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [11:05:54] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32165/console" [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [11:09:20] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6001.drmrs.wmnet with OS buster [11:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:29] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [11:11:49] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10aborrero) [11:11:53] (03PS1) 10Vgutierrez: upload_haproxy: Adopt ::profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/737005 (https://phabricator.wikimedia.org/T290005) [11:13:38] (03PS2) 10Vgutierrez: cache:haproxy: Require haproxy bpo pin to install it [puppet] - 10https://gerrit.wikimedia.org/r/736993 (https://phabricator.wikimedia.org/T290005) [11:13:41] (03PS2) 10Vgutierrez: upload_haproxy: Adopt ::profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/737005 (https://phabricator.wikimedia.org/T290005) [11:13:43] (03PS6) 10Vgutierrez: site: Use role cache::upload_haproxy for cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/736477 (https://phabricator.wikimedia.org/T290005) [11:15:18] (03CR) 10Vgutierrez: [C: 03+2] prometheus::ops: Add haproxy-tls@cache_upload config [puppet] - 10https://gerrit.wikimedia.org/r/736278 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:16:21] (03CR) 10Vgutierrez: [C: 03+2] cache:haproxy: Require haproxy bpo pin to install it [puppet] - 10https://gerrit.wikimedia.org/r/736993 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:17:25] (03CR) 10Vgutierrez: [C: 03+2] upload_haproxy: Adopt ::profile::base::production [puppet] - 10https://gerrit.wikimedia.org/r/737005 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:24:08] (03PS8) 10Giuseppe Lavagetto: php: allow installing multiple php versions at the same time [puppet] - 10https://gerrit.wikimedia.org/r/736276 (https://phabricator.wikimedia.org/T293450) [11:24:10] (03PS7) 10Giuseppe Lavagetto: profile::mediawiki::php: Allow running multiple php versions in parallel [puppet] - 10https://gerrit.wikimedia.org/r/736948 (https://phabricator.wikimedia.org/T293450) [11:24:12] (03PS5) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 [11:26:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32166/console" [puppet] - 10https://gerrit.wikimedia.org/r/736949 (owner: 10Giuseppe Lavagetto) [11:26:45] (03CR) 10David Caro: "Some minor refactor/etc stuff, also can you `black -l 120` the whole file?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [11:33:42] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10aborrero) [11:35:07] PROBLEM - MariaDB read only x2 #page on db2142 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.21-MariaDB-log, Uptime 10071s, event_scheduler: True, 29.22 QPS, connection latency: 0.004301s, query latency: 0.000510s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:35:55] Amir1: ^ [11:36:04] what [11:36:09] you kidding [11:36:20] you need to do set global read_only = false [11:36:23] on both masters [11:36:27] PROBLEM - MariaDB read only x2 #page on db1151 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.21-MariaDB-log, Uptime 8110s, event_scheduler: True, 16.58 QPS, connection latency: 0.003875s, query latency: 0.000517s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:36:29] as X2 is multi master [11:36:40] I acked it [11:36:41] see? on both :-) [11:36:56] sobanski: thanks [11:37:08] And the other one too :) [11:37:18] for everyone: X2 isn't in use [11:38:10] Amir1: so set global read_only = false; on db1151 and db2142 [11:38:14] that will fix it [11:38:16] marostegui: done both [11:38:26] sweeet [11:38:33] RECOVERY - MariaDB read only x2 #page on db1151 is OK: Version 10.4.21-MariaDB-log, Uptime 8235s, read_only: False, event_scheduler: True, 29.41 QPS, connection latency: 0.004273s, query latency: 0.000517s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:38:39] sorry, did I miss it somewhere? [11:38:54] VO is amazing, now I got the ping [11:39:21] RECOVERY - MariaDB read only x2 #page on db2142 is OK: Version 10.4.21-MariaDB-log, Uptime 10323s, read_only: False, event_scheduler: True, 16.57 QPS, connection latency: 0.004278s, query latency: 0.000501s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:39:29] * Emperor admires timing of our washing machine that took their attention away between the page email arriving and you good people fixing it :) [11:39:40] there you go the recovers [11:40:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [11:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] marostegui: added to docs https://wikitech.wikimedia.org/w/index.php?title=MariaDB/Upgrading_a_section&diff=1931682&oldid=1931596 [11:41:32] Amir1: the problem is that this only applies to x2 [11:41:37] so it is a snowflake [11:41:53] Can we resolve the incidents in VO now? Just to make sure it doesn’t fire again tomorrow. [11:42:09] sobanski: yes, thanks [11:42:11] I’ll do it, just confirming [11:42:26] Ah, already done [11:42:30] (03PS3) 10JMeybohm: Rename everything to cfssl-issuer, ensure e2e completed [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736807 (https://phabricator.wikimedia.org/T294560) [11:42:32] (03PS5) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) [11:42:33] sobanski: sorry for having you here on holidays! [11:42:34] (03PS7) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [11:42:46] sobanski: done [11:43:01] * sobanski goes back to beer browsing [11:43:21] (03CR) 10Muehlenhoff: cumin: reorganize mediawiki aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [11:43:52] marostegui: are we going to have more multimaster in the future? (or turn other systems to multimaster) [11:44:57] nop for now [11:47:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:53:31] (03CR) 10Jbond: "update" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [11:54:14] (03PS9) 10Jbond: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 [12:01:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host testvm2001.codfw.wmnet [12:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:04] (03PS1) 10Ssingh: bird::anycast_healthchecker: allow customization of logging options [puppet] - 10https://gerrit.wikimedia.org/r/737015 [12:08:49] (03PS2) 10Ssingh: bird::anycast_healthchecker: allow customization of logging options [puppet] - 10https://gerrit.wikimedia.org/r/737015 [12:10:12] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS buster [12:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:21] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [12:12:23] (03CR) 10Ssingh: "No change for existing host:" [puppet] - 10https://gerrit.wikimedia.org/r/737015 (owner: 10Ssingh) [12:13:12] (03PS3) 10Ssingh: bird::anycast_healthchecker: allow customization of logging options [puppet] - 10https://gerrit.wikimedia.org/r/737015 [12:14:47] (03CR) 10Ssingh: "Change when the relevant Hiera is set:" [puppet] - 10https://gerrit.wikimedia.org/r/737015 (owner: 10Ssingh) [12:15:17] (03PS4) 10Ssingh: bird::anycast_healthchecker: allow customization of logging options [puppet] - 10https://gerrit.wikimedia.org/r/737015 [12:17:30] (03PS1) 10Ayounsi: Ferm: allow dhcp request from infra IPs [puppet] - 10https://gerrit.wikimedia.org/r/737020 (https://phabricator.wikimedia.org/T282787) [12:17:34] (03CR) 10Ssingh: "This is ready for review. I am happy to add more options (maybe log_maxbytes?) if desired. I am also not sure if there was a reason why we" [puppet] - 10https://gerrit.wikimedia.org/r/737015 (owner: 10Ssingh) [12:17:43] (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583) (owner: 10Hnowlan) [12:21:10] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/32170/install1003.wikimedia.org/index.html seems happy" [puppet] - 10https://gerrit.wikimedia.org/r/737020 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [12:22:40] (03Merged) 10jenkins-bot: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583) (owner: 10Hnowlan) [12:22:58] !log renamed Ganeti group of test cluster from "default" to "row_A" (following conventions in main DCs) T286206 [12:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] T286206: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 [12:23:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/737020 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [12:23:35] (03CR) 10Ayounsi: [C: 03+2] Ferm: allow dhcp request from infra IPs [puppet] - 10https://gerrit.wikimedia.org/r/737020 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [12:24:33] (03CR) 10Muehlenhoff: Ferm: allow dhcp request from infra IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737020 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [12:26:38] (03PS1) 10Ayounsi: FIX DHCP ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/737023 [12:27:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/737023 (owner: 10Ayounsi) [12:28:52] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10User-jbond: Document IDP MFA policy and processes - https://phabricator.wikimedia.org/T284725 (10MoritzMuehlenhoff) 05Open→03In progress [12:29:32] (03CR) 10Ayounsi: [C: 03+2] FIX DHCP ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/737023 (owner: 10Ayounsi) [12:31:40] PROBLEM - Check systemd state on install1003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:46] RECOVERY - Check systemd state on install1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:14] (03PS1) 10Hnowlan: api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737028 (https://phabricator.wikimedia.org/T289583) [12:50:26] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6001.drmrs.wmnet with OS buster [12:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:36] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [12:51:03] (03PS6) 10JMeybohm: Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) [12:51:05] (03PS8) 10JMeybohm: Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) [13:09:17] (03PS1) 10Jelto: services: cleanup all helmfiles after helm3 migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/737034 (https://phabricator.wikimedia.org/T251305) [13:19:32] (03CR) 10Jelto: [C: 04-1] "do not merge, services need to be redeployed first" [deployment-charts] - 10https://gerrit.wikimedia.org/r/737034 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:20:06] (03PS2) 10Ema: prometheus: remove varnish_2layer [puppet] - 10https://gerrit.wikimedia.org/r/736787 (https://phabricator.wikimedia.org/T241239) [13:21:36] (03PS5) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [13:21:58] (03CR) 10Ema: [C: 03+2] prometheus: remove varnish_2layer [puppet] - 10https://gerrit.wikimedia.org/r/736787 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [13:22:15] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [13:24:47] vgutierrez: there seems to be a puppet error on the prometheus hosts, although I don't think icinga complained? [13:24:55] Nov 5 13:14:23 prometheus3001 puppet-agent[23018]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Prometheus::Cluster_config[haproxy_tls_upload_esams]: [13:25:02] Nov 5 13:14:23 prometheus3001 puppet-agent[23018]: has no parameter named 'class_name' [13:28:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [13:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:38] (03PS1) 10Ema: Revert "prometheus::ops: Add haproxy-tls@cache_upload config" [puppet] - 10https://gerrit.wikimedia.org/r/736834 [13:28:54] (03PS2) 10Ema: Revert "prometheus::ops: Add haproxy-tls@cache_upload config" [puppet] - 10https://gerrit.wikimedia.org/r/736834 [13:31:03] (03CR) 10Ema: [C: 03+2] Revert "prometheus::ops: Add haproxy-tls@cache_upload config" [puppet] - 10https://gerrit.wikimedia.org/r/736834 (owner: 10Ema) [13:32:31] (03CR) 10Jbond: [C: 03+2] controller: update so we can run pcc on both cloud and production hosts (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [13:32:41] (03CR) 10Jbond: [C: 03+2] populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 (owner: 10Jbond) [13:33:40] (03Merged) 10jenkins-bot: populate_puppetdb: add support for cloud nodes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736758 (owner: 10Jbond) [13:33:42] (03Merged) 10jenkins-bot: controller: update so we can run pcc on both cloud and production hosts [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/736803 (owner: 10Jbond) [13:36:02] (03PS1) 10Jbond: cfssl::config: support per profile auth keys [puppet] - 10https://gerrit.wikimedia.org/r/737036 [13:42:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2001.codfw.wmnet [13:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:54] (03PS3) 10Herron: role::elasticsearch::cirrus: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) [14:00:19] (03CR) 10ArielGlenn: [C: 03+1] "Looks good, very helpful." [puppet] - 10https://gerrit.wikimedia.org/r/736815 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:02:58] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [14:04:12] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:09:34] (03PS1) 10Elukey: sslcert::trusted_ca: add jks truststore [puppet] - 10https://gerrit.wikimedia.org/r/737044 [14:09:51] (03PS1) 10JMeybohm: Add new golang 1.17 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737045 [14:10:20] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.14 ms [14:10:58] (03CR) 10Majavah: Add new golang 1.17 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737045 (owner: 10JMeybohm) [14:11:34] (03PS2) 10JMeybohm: Add new golang 1.17 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737045 [14:13:00] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32171/console" [puppet] - 10https://gerrit.wikimedia.org/r/737044 (owner: 10Elukey) [14:14:38] (03CR) 10Elukey: sslcert::trusted_ca: add jks truststore [puppet] - 10https://gerrit.wikimedia.org/r/737044 (owner: 10Elukey) [14:15:09] (03PS2) 10Elukey: sslcert::trusted_ca: add jks truststore [puppet] - 10https://gerrit.wikimedia.org/r/737044 [14:16:01] (03PS6) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [14:16:33] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [14:17:44] (03CR) 10JMeybohm: Add new golang 1.17 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737045 (owner: 10JMeybohm) [14:17:59] (03Abandoned) 10Ppchelko: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736913 (owner: 10PipelineBot) [14:18:23] (03Abandoned) 10Ppchelko: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736909 (owner: 10PipelineBot) [14:18:37] (03Abandoned) 10Ppchelko: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736907 (owner: 10PipelineBot) [14:19:55] (03PS7) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [14:20:30] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) (owner: 10Arturo Borrero Gonzalez) [14:24:04] (03PS8) 10Arturo Borrero Gonzalez: cloud: introduce network tests [puppet] - 10https://gerrit.wikimedia.org/r/736819 (https://phabricator.wikimedia.org/T294955) [14:24:33] (03CR) 10Elukey: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737045 (owner: 10JMeybohm) [14:25:31] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add new golang 1.17 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737045 (owner: 10JMeybohm) [14:26:03] (03CR) 10Andrew Bogott: Added cookbook to create an nfs server (036 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [14:26:05] (03CR) 10Muehlenhoff: Retire role::mediawiki::common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff) [14:30:29] !log published docker-registry.discovery.wmnet/golang1.17:1.17-1 [14:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:15] (03PS2) 10Jbond: cfssl::config: support per profile auth keys [puppet] - 10https://gerrit.wikimedia.org/r/737036 [14:34:34] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10herron) [14:39:53] 10SRE-OnFire: 2021-10-22 eqiad return path timeouts - https://phabricator.wikimedia.org/T295152 (10herron) p:05Triage→03Medium [14:40:19] 10SRE-OnFire: 2021-10-22 eqiad return path timeouts - https://phabricator.wikimedia.org/T295152 (10herron) [14:40:22] 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (10herron) [14:41:39] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10User-dcaro, 10cloud-services-team (Kanban): Investigate use of Puppet "environments" for per-project Puppet manifests - https://phabricator.wikimedia.org/T170370 (10dcaro) [14:42:34] (03CR) 10David Caro: [C: 03+1] toolforge::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) (owner: 10Majavah) [14:42:44] 10SRE-OnFire: 2021-10-25 s3 db recentchanges replica - https://phabricator.wikimedia.org/T295154 (10herron) p:05Triage→03Medium [14:43:14] 10SRE-OnFire: 2021-10-25 s3 db recentchanges replica - https://phabricator.wikimedia.org/T295154 (10herron) [14:44:46] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 2342.53 ms [14:48:14] 10SRE-OnFire: 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10herron) p:05Triage→03Medium [14:50:52] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.48 ms [14:52:50] (03CR) 10David Caro: Added cookbook to create an nfs server (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [14:54:46] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10herron) As an initial approach to help visualize incidents needing to be reviewed and scored in Phabricator I've added a column to the #SRE-Onfire... [14:59:56] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability, and 2 others: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 (10Aklapper) [15:02:22] (03CR) 10Herron: "the gelf_relay is in place on elastic1049, this would deploy it to the remainder of role::elasticsearch::cirrus. after that will need som" [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [15:06:23] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737028 (https://phabricator.wikimedia.org/T289583) (owner: 10Hnowlan) [15:11:27] (03Merged) 10jenkins-bot: api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/737028 (https://phabricator.wikimedia.org/T289583) (owner: 10Hnowlan) [15:21:07] (03PS1) 10Tks4Fish: kswiki: Adding wordmark and tagline files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737054 (https://phabricator.wikimedia.org/T294093) [15:22:14] (03PS4) 10Ideophagous: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735712 [15:22:26] (03PS4) 10Ideophagous: reapplied changes to arywiki ns after hard reset, Bug:T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 [15:24:46] (03PS3) 10Elukey: sslcert::trusted_ca: add jks truststore [puppet] - 10https://gerrit.wikimedia.org/r/737044 [15:24:48] (03PS1) 10Elukey: profile::kafka::broker: use a jks truststore for PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/737055 (https://phabricator.wikimedia.org/T291905) [15:26:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32174/console" [puppet] - 10https://gerrit.wikimedia.org/r/737055 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:28:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/737044 (owner: 10Elukey) [15:30:35] (03CR) 10Jbond: [C: 03+1] "lgtn" [puppet] - 10https://gerrit.wikimedia.org/r/737055 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:35:58] (03CR) 10Elukey: [C: 03+2] sslcert::trusted_ca: add jks truststore [puppet] - 10https://gerrit.wikimedia.org/r/737044 (owner: 10Elukey) [15:36:11] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::broker: use a jks truststore for PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/737055 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:38:16] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:38:17] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:32] (03PS1) 10Elukey: java::cacert: add the -keystore option when creating a custom truststore [puppet] - 10https://gerrit.wikimedia.org/r/737058 [15:43:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32175/console" [puppet] - 10https://gerrit.wikimedia.org/r/737058 (owner: 10Elukey) [15:45:08] (03CR) 10Elukey: [V: 03+1 C: 03+2] java::cacert: add the -keystore option when creating a custom truststore [puppet] - 10https://gerrit.wikimedia.org/r/737058 (owner: 10Elukey) [15:53:33] (03CR) 10Dzahn: [C: 03+2] ":) thx" [puppet] - 10https://gerrit.wikimedia.org/r/736815 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:53:40] (03PS1) 10Tks4Fish: kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) [15:54:17] (03PS1) 10David Caro: ceph::auth::load_all: allow generating the keyring path [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) [15:54:19] (03PS1) 10David Caro: ceph::auth: skip keys with no keydata [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) [15:54:38] (03CR) 10jerkins-bot: [V: 04-1] kswiki: Adding wordmark and tagline to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737061 (https://phabricator.wikimedia.org/T294093) (owner: 10Tks4Fish) [15:59:22] (03PS2) 10David Caro: ceph::auth: skip keys with no keydata [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) [16:00:06] (03PS1) 10AOkoth: gitlab: accept backup file argument [puppet] - 10https://gerrit.wikimedia.org/r/737064 (https://phabricator.wikimedia.org/T274463) [16:01:09] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [16:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:00] (03PS2) 10David Caro: ceph::auth::load_all: allow generating the keyring path [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) [16:05:02] (03PS3) 10David Caro: ceph::auth: skip keys with no keydata [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) [16:05:58] (03PS1) 10Elukey: profile::presto::server: use a stricter truststore [puppet] - 10https://gerrit.wikimedia.org/r/737065 [16:07:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32179/console" [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [16:07:14] (03CR) 10jerkins-bot: [V: 04-1] profile::presto::server: use a stricter truststore [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [16:07:25] uff [16:08:02] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32178/console" [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [16:10:05] (03PS2) 10Elukey: profile::presto::server: use a stricter truststore [puppet] - 10https://gerrit.wikimedia.org/r/737065 [16:11:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32180/console" [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [16:11:37] (03PS2) 10AOkoth: gitlab: accept backup file argument [puppet] - 10https://gerrit.wikimedia.org/r/737064 (https://phabricator.wikimedia.org/T274463) [16:18:11] (03PS3) 10AOkoth: gitlab: accept backup file argument [puppet] - 10https://gerrit.wikimedia.org/r/737064 (https://phabricator.wikimedia.org/T274463) [16:21:29] (03PS3) 10Elukey: profile::presto::server: use a stricter truststore [puppet] - 10https://gerrit.wikimedia.org/r/737065 [16:21:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [16:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32181/console" [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [16:25:44] (03CR) 10Cwhite: [C: 03+1] role::elasticsearch::cirrus: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [16:30:18] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS buster [16:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:27] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [16:39:50] (03PS1) 10Elukey: profile::kafka::mirror: add settings to support the migration to PKI [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) [16:40:56] (03CR) 10Arturo Borrero Gonzalez: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [16:41:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph::auth::load_all: allow generating the keyring path [puppet] - 10https://gerrit.wikimedia.org/r/737062 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [16:45:38] (03CR) 10Arturo Borrero Gonzalez: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736201 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [16:45:45] (03PS5) 10Andrew Bogott: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 [16:47:01] (03CR) 10Arturo Borrero Gonzalez: "Does it makes sense, in the same patch, to load the ::deploy profile on the codfw1dev hypervisors?" [puppet] - 10https://gerrit.wikimedia.org/r/736483 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [16:49:40] (03PS1) 10Vgutierrez: Revert "Revert "prometheus::ops: Add haproxy-tls@cache_upload config"" [puppet] - 10https://gerrit.wikimedia.org/r/737092 [16:51:04] (03PS2) 10Elukey: profile::kafka::mirror: add settings to support the migration to PKI [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) [16:52:17] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6001.drmrs.wmnet with OS buster [16:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:25] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster executed with errors: - gan... [16:54:11] (03PS3) 10Elukey: profile::kafka::mirror: add settings to support the migration to PKI [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) [16:57:11] (03PS4) 10Elukey: profile::kafka::mirror: add settings to support the migration to PKI [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) [16:58:20] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:58:23] (03PS1) 10Majavah: hieradata: replace deployment-prep mwmaint to buster [puppet] - 10https://gerrit.wikimedia.org/r/737093 (https://phabricator.wikimedia.org/T278664) [17:00:51] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install drmrs non-cp-hosts - https://phabricator.wikimedia.org/T286507 (10RobH) a:03MMandere @mmandere is handling these installations, so I'm reassigning this to him. Once these hosts are installed and calling into puppet, please update and resolve this task. [17:01:02] 10ops-drmrs, 10DC-Ops: (Need By: TBD) rack/setup/install cp60[01-16] - https://phabricator.wikimedia.org/T286504 (10RobH) a:03MMandere @mmandere is handling these installations, so I'm reassigning this to him. Once these hosts are installed and calling into puppet, please update and resolve this task. [17:01:19] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32186/console" [puppet] - 10https://gerrit.wikimedia.org/r/737092 (owner: 10Vgutierrez) [17:03:02] (03PS5) 10Elukey: profile::kafka::mirror: add settings to support the migration to PKI [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) [17:03:04] (03PS1) 10Elukey: sslcert::trusted_ca: check if the bundle .pem is defined [puppet] - 10https://gerrit.wikimedia.org/r/737095 (https://phabricator.wikimedia.org/T291905) [17:04:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32187/console" [puppet] - 10https://gerrit.wikimedia.org/r/737091 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:04:21] (03PS1) 10Jdlrobson: WikidataPageBanner should disable table of contents using public functions [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737075 (https://phabricator.wikimedia.org/T295003) [17:06:14] (03CR) 10Btullis: "I see there is no linked ticket here." [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [17:08:03] (03PS2) 10Majavah: hieradata: replace deployment-prep mwmaint to buster [puppet] - 10https://gerrit.wikimedia.org/r/737093 (https://phabricator.wikimedia.org/T278664) [17:09:12] (03CR) 10Elukey: [V: 03+1] profile::presto::server: use a stricter truststore (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [17:10:16] (03PS4) 10Elukey: profile::presto::server: use a stricter truststore [puppet] - 10https://gerrit.wikimedia.org/r/737065 [17:10:27] (03CR) 10Elukey: profile::presto::server: use a stricter truststore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737065 (owner: 10Elukey) [17:15:08] (03PS1) 10BBlack: Use FQDN in for drmrs switches in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/737096 (https://phabricator.wikimedia.org/T283050) [17:15:37] (03CR) 10Btullis: [C: 03+1] "Looks good to me. One totally optional nit." [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [17:16:16] (03CR) 10BBlack: [C: 03+2] Use FQDN in for drmrs switches in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/737096 (https://phabricator.wikimedia.org/T283050) (owner: 10BBlack) [17:16:20] (03PS1) 10Elukey: Add comment about Druid data retention for webrequest_sampled_128 [puppet] - 10https://gerrit.wikimedia.org/r/737097 [17:20:00] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/737100 [17:20:51] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/737100 [17:22:25] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:22:46] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 186 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:22:53] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS buster [17:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:03] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster [17:23:47] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) Finally some progress! Some notes: * We have now a generic define to create .p12/.jks truststores cont... [17:26:28] (03CR) 10BBlack: [C: 03+2] hieradata: replace deployment-prep mwmaint to buster [puppet] - 10https://gerrit.wikimedia.org/r/737093 (https://phabricator.wikimedia.org/T278664) (owner: 10Majavah) [17:27:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:28:42] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:18] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:50] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:48:11] (03PS1) 10Majavah: scap: use new pontoon logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/737108 [17:50:37] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/737108 (owner: 10Majavah) [17:54:25] (03CR) 10AOkoth: "WIP!" [puppet] - 10https://gerrit.wikimedia.org/r/737064 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [17:54:56] (03PS1) 10Mforns: analytics:refinery:job:refine_sanitize: Avoid monitor false alerts [puppet] - 10https://gerrit.wikimedia.org/r/737110 [17:55:08] (03CR) 10Joal: [C: 03+1] "Thanks Luca :)" [puppet] - 10https://gerrit.wikimedia.org/r/737097 (owner: 10Elukey) [17:57:16] (03CR) 10Joal: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/737110 (owner: 10Mforns) [18:01:01] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6001.drmrs.wmnet with OS buster [18:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:10] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host ganeti6001.drmrs.wmnet with OS buster completed: - ganeti6001 (**... [18:01:53] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [18:06:46] (03CR) 10Ahmon Dancy: [C: 03+1] scap: use new pontoon logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/737108 (owner: 10Majavah) [18:06:52] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [18:16:10] 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Legoktm) @joe and I discussed how to do this today. To recap, the goals of this are to work towards {T278495} (eliminating the special case IP) as well as being behind the caching l... [18:16:38] (03CR) 10Dzahn: [C: 03+2] wikitech::web: remove font packages from wikitech servers [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [18:20:39] !log upgrading scap to 4.0.3 everywhere (T294966) [18:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:42] T294966: Deploy Scap version 4.0.3 - https://phabricator.wikimedia.org/T294966 [18:23:59] (03PS7) 10Dzahn: wikitech::web: remove font packages from wikitech servers [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) [18:35:45] !log cr2-codfw> request chassis fpc online slot 0 - T294789 [18:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/32189/labweb1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/735042 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [18:41:03] !log removing mediawiki font packages from labweb* (wikitech wiki) [18:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:48] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 102, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:53:24] (03CR) 10Herron: [C: 03+2] scap: use new pontoon logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/737108 (owner: 10Majavah) [18:56:40] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 141, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:57:07] (03PS1) 10Cwhite: hiera: set prometheus nodes for cloud logging env [puppet] - 10https://gerrit.wikimedia.org/r/737115 [18:58:43] (03PS3) 10Dzahn: snapshot: convert 2 crons for full and partial dumps into systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) [18:59:50] (03CR) 10Dzahn: "now using the new silent option" [puppet] - 10https://gerrit.wikimedia.org/r/736599 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:02:00] (03CR) 10Dzahn: "Yea.. hmm you are not the first to talk about new aliases instead. My intention though was to avoid adding on to the clutter. I'll leave t" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [19:04:16] (03CR) 10Dzahn: [C: 03+2] "unlike the bigger reorg change this one seems to be uncontroversial enough to just go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/736594 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [19:06:47] (03CR) 10Dzahn: "deployed and double checked on cumin1001. before hit 19 hosts, now 23 with the 4 additional parsoid canaries" [puppet] - 10https://gerrit.wikimedia.org/r/736594 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [19:12:49] 10SRE, 10Infrastructure-Foundations: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [19:12:51] 10SRE: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [19:13:22] (03PS1) 10Cwhite: beta: open up opensearch api port to labs in ferm [puppet] - 10https://gerrit.wikimedia.org/r/737116 [19:18:19] (03CR) 10Cwhite: [C: 03+2] beta: open up opensearch api port to labs in ferm [puppet] - 10https://gerrit.wikimedia.org/r/737116 (owner: 10Cwhite) [19:18:39] (03CR) 10Cwhite: [C: 03+2] hiera: set prometheus nodes for cloud logging env [puppet] - 10https://gerrit.wikimedia.org/r/737115 (owner: 10Cwhite) [19:23:14] Hello, is possible to reset 2fa on account? I'm logged in, but I've lost the device [19:24:56] You'd be best emailing trust & safety or creating a task [19:25:06] As I believe they mainly handle resets [19:25:37] (03PS1) 10Bartosz Dziewoński: ArticleTargetSaver: ve.init may be undefined [extensions/VisualEditor] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737077 (https://phabricator.wikimedia.org/T294981) [19:25:52] (03PS3) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/737100 [19:26:19] k, thx RhinosF1! [19:44:07] ebernhardson: looks like checkuser wiki has no search index (anymore?) users reported on wikitech but maybe checkuser is special. saw that in some similar cases you fixed search for wikis by reindexing. is this one of those cases? https://phabricator.wikimedia.org/T295192 [19:57:06] 10SRE, 10Infrastructure-Foundations, 10netops: Move management routers ssh port - https://phabricator.wikimedia.org/T277438 (10RobH) [19:59:26] 10SRE, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10RobH) 05Open→03Stalled a:05RobH→03None So next steps are currently stalled, due to the fact we don't have a recycler setup for Singapore yet. We may just have Jin/DreamIIC handle disposal as well, not sure yet.... [19:59:58] 10SRE, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10RobH) We have Jin going out for the mr1-eqsin replacement. We may want to decide if we want DreamIIC to just dispose of all our old network kit. [20:04:00] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10Dzahn) As of right now both wcqs1002 and wcqs2001 seem to be running normal, blazegraph is active and all Icinga checks are green/OK. It's not obvious wh... [20:08:13] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: wcqs1002 and wcqs2001 unresponsive - https://phabricator.wikimedia.org/T294865 (10Dzahn) >>! In T294865#7486509, @Dzahn wrote: > It's not obvious what the issue was but it seems gone now. Oh, sorry, I saw T294961#7480793 and T294961#7... [20:09:40] !log rolling back 1.38.0-wmf.7 from all wikis due to UBN T295187 (refs T293948) [20:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:45] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [20:09:45] T295187: Chinese conversion no longer work in the table of content - https://phabricator.wikimedia.org/T295187 [20:14:57] (03PS1) 10Dzahn: install_server: remove malmok.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/737120 (https://phabricator.wikimedia.org/T286480) [20:15:38] (03CR) 10jerkins-bot: [V: 04-1] install_server: remove malmok.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/737120 (https://phabricator.wikimedia.org/T286480) (owner: 10Dzahn) [20:17:19] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "all wikis to 1.38.0-wmf.7 refs T293948" [20:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:22] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [20:17:32] 10SRE, 10Traffic, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10Dzahn) >>! In T286480#7221256, @ops-monitoring-bot wrote: > Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2) confirmed this is not in netbox and not in DNS rep... [20:18:27] (03PS2) 10Dzahn: install_server: remove malmok.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/737120 (https://phabricator.wikimedia.org/T286480) [20:19:42] (03CR) 10Ssingh: [C: 03+1] "Thanks Daniel and thank you malmok :)" [puppet] - 10https://gerrit.wikimedia.org/r/737120 (https://phabricator.wikimedia.org/T286480) (owner: 10Dzahn) [20:20:22] 10SRE, 10Traffic, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10Dzahn) I also don't see this host in debmonitor. It seems all done here besides the 2 entries in DHCP/installserver? [20:21:14] (03CR) 10Dzahn: [C: 03+2] "Thanks Sukhbir" [puppet] - 10https://gerrit.wikimedia.org/r/737120 (https://phabricator.wikimedia.org/T286480) (owner: 10Dzahn) [20:24:15] 10SRE, 10Traffic, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10ssingh) >>! In T286480#7486550, @Dzahn wrote: > I also don't see this host in debmonitor. It seems all done here besides the 2 entries in DHCP/installserver? I think that should be it. Last... [20:25:18] 10SRE, 10Traffic, 10Patch-For-Review: Decomission malmok.wikimedia.org - https://phabricator.wikimedia.org/T286480 (10Dzahn) 05Open→03Resolved ACK!:) Also not in Icinga, so it's gone from puppet db. [20:48:56] are the sites slow for anyone else, or just me? [20:50:14] WFM [20:50:50] Fine here [20:58:29] (03Abandoned) 10Ideophagous: updated arywiki namespaces as per T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 (owner: 10Ideophagous) [21:02:16] 10ops-ulsfo, 10DC-Ops: ulsfo cable ids missing - https://phabricator.wikimedia.org/T295198 (10RobH) [21:02:39] 10ops-ulsfo, 10DC-Ops: ulsfo cable ids missing - https://phabricator.wikimedia.org/T295198 (10RobH) [21:06:34] (03CR) 10Krinkle: [C: 03+1] "No usage elsewhere, single-file, easy deploy indeed. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734574 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [21:40:18] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-psi-eqiad.service.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:57] ^ that's me typoing a restart, surprised it now thinks a service exists for that. Will try and clear [21:41:46] PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-psi-eqiad.service.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:30] RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:56] RECOVERY - Check systemd state on elastic1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:25] !log rolling back 1.38.0-wmf.7 from group1 and group0 due to UBN T295187 (refs T293948) [22:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:30] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [22:19:30] T295187: Chinese conversion no longer work in the table of content - https://phabricator.wikimedia.org/T295187 [22:21:34] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0/group1 to 1.38.0-wmf.7 refs T293948" [22:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:18] (03PS1) 10Dduvall: Revert "all wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737142 [22:24:20] (03CR) 10Dduvall: [C: 03+2] Revert "all wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737142 (owner: 10Dduvall) [22:24:22] (03PS1) 10Dduvall: Revert "group1 wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737143 [22:24:24] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737143 (owner: 10Dduvall) [22:24:26] (03PS1) 10Dduvall: Revert "group0 wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737144 [22:24:28] (03CR) 10Dduvall: [C: 03+2] Revert "group0 wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737144 (owner: 10Dduvall) [22:25:42] (03Merged) 10jenkins-bot: Revert "all wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737142 (owner: 10Dduvall) [22:25:44] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737143 (owner: 10Dduvall) [22:25:46] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.38.0-wmf.7 refs T293948" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737144 (owner: 10Dduvall) [22:30:14] (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737145 [22:30:16] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737145 (owner: 10Dduvall) [22:31:00] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.7 refs T293948 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737145 (owner: 10Dduvall) [22:31:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:13] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.7 refs T293948 [22:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:15] T293948: 1.38.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T293948 [22:32:25] !log re-rolling 1.38.0-wmf.7 to all wikis due to a better of two evil regressions UBN T295187 (refs T293948) [22:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:29] T295187: Chinese conversion no longer work in the table of content - https://phabricator.wikimedia.org/T295187 [22:35:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:31] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10RobH) 05Open→03Resolved Thanks! Also thank you for showing how to set threshholds via comment so I can fix in future, it is much appreciated! [22:48:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:52] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:18:40] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1193.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:19:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:20] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:53:26] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica