[00:00:05] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T0000). [00:00:05] nray and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:21] here o/ [00:01:13] o/ [00:01:14] Hey everyone [00:01:17] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:33] tgr: want to do yours? :) [00:02:03] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:02:14] (03CR) 10Urbanecm: [C: 03+2] MobileWebUIActions tracks init event [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738399 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [00:02:20] (03CR) 10Urbanecm: [C: 03+2] We need some way to distinguish namespaces [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/739004 (https://phabricator.wikimedia.org/T294738) (owner: 10Nray) [00:02:32] urbanecm: would you mind doing it? they just need the rebase [00:02:44] tgr: sure thing :) [00:04:08] Can't say whether the format of the config is correct though :) [00:04:48] (03CR) 10Urbanecm: [C: 03+2] [beta] Disable GrowthExperiments Add Link on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739023 (owner: 10Gergő Tisza) [00:05:11] actually, the non-beta patch probably needs sync-file, since it will be read by the next branch, and it might not get automatically synced before then? [00:05:19] Yup [00:05:31] But should be no op for prod otherwise [00:05:36] (03Merged) 10jenkins-bot: [beta] Disable GrowthExperiments Add Link on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739023 (owner: 10Gergő Tisza) [00:05:40] yeah, nothing using it right now [00:06:04] (03CR) 10Urbanecm: [C: 03+2] labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [00:06:09] (03PS3) 10Urbanecm: labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [00:06:17] (03CR) 10Urbanecm: [C: 03+2] labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [00:06:59] (03Merged) 10jenkins-bot: MobileWebUIActions tracks init event [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738399 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [00:07:01] (03Merged) 10jenkins-bot: We need some way to distinguish namespaces [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/739004 (https://phabricator.wikimedia.org/T294738) (owner: 10Nray) [00:07:04] (03Merged) 10jenkins-bot: labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [00:07:37] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:07:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:25] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:11:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:55] the wmf backports were quicker than expected [00:13:43] nray: pulled to mwdebug1001, can you test please? [00:14:01] yes, thank you. are both of the patches on that server? [00:14:07] affirmative [00:14:10] cool, checking [00:17:18] (03PS3) 10Urbanecm: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [00:17:22] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [00:18:17] (03Merged) 10jenkins-bot: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [00:19:40] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 50d9f2687cd11e6f838313a530c6bbd498d0b83e: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform (T294737) (duration: 00m 56s) [00:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:44] T294737: Add an image: experiment - https://phabricator.wikimedia.org/T294737 [00:19:51] tgr: your patches should be merged/synced :) [00:19:59] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:19:59] thanks urbanecm! [00:20:25] @urbanecm you may proceed. I'll be monitoring our event logging graphs after you deploy to make sure we don't get extreme spikes [00:20:28] I'll test in beta once the corresponding extension patch is merged [00:21:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:07] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:22:14] nray: perfect, syncing [00:23:44] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/WikimediaEvents/: 738399: 739004: WikimediaEvents backports (T294738) (duration: 00m 56s) [00:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:48] T294738: Define and instrument bounce rate on talk pages - https://phabricator.wikimedia.org/T294738 [00:23:50] nray: and live [00:23:52] anything else? [00:24:02] thats it. Thank you! [00:24:22] any time! [00:25:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:33] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:27:42] (03PS1) 10Legoktm: httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) [00:28:52] !log UTC late window done [00:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:53] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:35:15] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:36:59] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:38:09] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:38:21] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:39:29] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:09:41] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:16:07] (03PS1) 10Gergő Tisza: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 [01:19:07] (03CR) 10Urbanecm: [C: 04-1] "-labs looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (owner: 10Gergő Tisza) [01:22:07] (03PS1) 10Ebernhardson: Add CirrusSearch Old GC Hell alerting [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) [01:23:50] (03CR) 10Gergő Tisza: GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (owner: 10Gergő Tisza) [01:25:49] ^ I have a late followup to the deploy window. [01:30:49] (03PS2) 10Gergő Tisza: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) [01:39:19] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 [02:06:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot) [02:07:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:09] (03CR) 10RLazarus: "Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [02:10:37] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:10:39] (03CR) 10RLazarus: httpbb: Add some tests for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [02:10:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:59] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T0300) [04:32:37] PROBLEM - ElasticSearch shard size check - 9200 on logstash1035 is CRITICAL: CRITICAL - logstash-mediawiki-2021.11.14(383.6666666666667gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [05:20:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:21] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcephmon1001, cloudcephmon1003, cloudcontrol1005, cloudcontrol1003, cloudcephmon1002, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:27:37] (03CR) 10Legoktm: httpbb: Add some tests for thumbor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [06:27:54] (03PS2) 10Legoktm: httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) [06:28:02] (03CR) 10Legoktm: [C: 04-1] httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [06:34:15] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:45] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) @Jclark-ctr Hi! Before proceeding with the nodes do you mind to ping me or my team first? We are thinking of changing name to reflect the fact... [07:19:54] (03CR) 10Elukey: Configure stat servers to use /srv/spark-tmp as spark.local.dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [07:25:10] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS buster [07:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:20] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster [07:27:19] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:27:35] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:27:45] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:32:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/6 UP : OSPFv3: 4/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:35:27] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:35:41] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:35:51] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:36:13] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:36:46] (03CR) 10Elukey: "I think this is a great step in the right direction, thanks a lot for working on it! Left some comments :)" [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [07:52:13] 10SRE, 10ops-drmrs: Degraded RAID on cp6002 - https://phabricator.wikimedia.org/T295747 (10ops-monitoring-bot) [08:02:16] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcephmon1003, cloudcephmon1001, cloudcontrol1005, cloudcephmon1002, cloudcontrol1003, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:04:16] (03PS2) 10Muehlenhoff: admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond) [08:04:33] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6002.drmrs.wmnet with OS buster [08:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:41] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster completed: - cp6002 (**WARN**)... [08:05:00] (03CR) 10jerkins-bot: [V: 04-1] admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond) [08:06:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:43] (03PS3) 10Muehlenhoff: admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond) [08:10:36] (03CR) 10Muehlenhoff: [C: 03+2] admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond) [08:13:08] (03PS5) 10Muehlenhoff: Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) [08:14:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6003.drmrs.wmnet with OS buster [08:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:51] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster [08:18:28] (03PS5) 10Ema: varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) [08:18:42] (03CR) 10Ema: varnish: add varnishmtail-wrapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [08:24:51] (03PS4) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) [08:41:30] (03CR) 10Ayounsi: [C: 03+2] _get_junos_router_interfaces: ignore VCP interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/738905 (owner: 10Ayounsi) [08:41:49] (03CR) 10Ayounsi: [C: 03+2] test_interface_termination_names: add breakout cables support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738913 (owner: 10Ayounsi) [08:52:51] (03CR) 10Mbch331: Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [08:52:54] (03PS3) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [08:53:04] (03CR) 10jerkins-bot: [V: 04-1] Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [08:54:34] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6003.drmrs.wmnet with OS buster [08:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:44] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster completed: - cp6003 (**WARN**)... [09:05:29] (03CR) 10Kosta Harlan: [C: 04-1] GrowthExperiments configuration fixes (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [09:09:38] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6004.drmrs.wmnet with OS buster [09:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:48] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster [09:15:09] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Helm chart dependencies no longer in requitements.yaml - https://phabricator.wikimedia.org/T295750 (10JMeybohm) [09:15:32] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Helm chart dependencies no longer in requirements.yaml - https://phabricator.wikimedia.org/T295750 (10JMeybohm) [09:18:15] (03PS3) 10Gergő Tisza: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) [09:19:29] (03PS1) 10Majavah: Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 [09:19:48] (03CR) 10Gergő Tisza: GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [09:20:45] (03CR) 10jerkins-bot: [V: 04-1] Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah) [09:33:24] (03PS1) 10Jgiannelos: tile-pregeneration: Fix argument order for batching [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115 [09:38:47] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] _get_junos_router_interfaces: ignore VCP interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/738905 (owner: 10Ayounsi) [09:38:54] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 58, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:39:13] !log ayounsi@deploy1002 Started deploy [homer/deploy@c570af3]: Homer CR738905 [09:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:39] !log ayounsi@deploy1002 Finished deploy [homer/deploy@c570af3]: Homer CR738905 (duration: 01m 25s) [09:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:46] (03PS4) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) [09:45:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph::auth::keyring: Generate keyring_path if not passed [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:46:01] (03CR) 10Kosta Harlan: GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [09:46:54] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 60, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:46:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32428/console" [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [09:47:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph::auth::keyring: allow passing the full client name [puppet] - 10https://gerrit.wikimedia.org/r/738903 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:48:56] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6004.drmrs.wmnet with OS buster [09:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:06] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster completed: - cp6004 (**WARN**)... [09:50:44] (03PS2) 10David Caro: ceph::auth::keyring: allow passing the full client name [puppet] - 10https://gerrit.wikimedia.org/r/738903 (https://phabricator.wikimedia.org/T293752) [09:50:50] (03PS2) 10David Caro: ceph::auth::keyring: Generate keyring_path if not passed [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) [09:51:02] (03CR) 10Vgutierrez: [C: 03+1] varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [09:51:07] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6005.drmrs.wmnet with OS buster [09:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:16] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster [09:52:57] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:52:57] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32429/console" [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [09:53:30] 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) OK! We don't really care about the OS drive size. What's important here is the extra drive for LVM, which should have at least 20G. You create the VMs or I do? I never did it b... [09:54:05] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:54:08] (03PS5) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) [09:55:21] (03PS1) 10Jgiannelos: tegola-vector-tiles: Disable debugging on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/739118 [09:55:29] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32430/console" [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [09:57:27] (03CR) 10Btullis: "PCC looks better this time." [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [09:58:00] (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce role for cloudbackup-dev [puppet] - 10https://gerrit.wikimedia.org/r/738376 (https://phabricator.wikimedia.org/T295584) [09:58:50] (03CR) 10Elukey: [C: 03+1] Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [10:02:10] (03CR) 10Btullis: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [10:02:28] (03CR) 10Btullis: [V: 03+1 C: 03+2] Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [10:02:30] !log A:cp disable puppet to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/738910 on cp4021 T293879 [10:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:34] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [10:02:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-srpeadcheck-tools: add new shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [10:03:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add k8s 1.21 to stretch too [puppet] - 10https://gerrit.wikimedia.org/r/738912 (https://phabricator.wikimedia.org/T282942) (owner: 10Majavah) [10:04:52] (03CR) 10Ema: [C: 03+2] varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:06:18] !log updating deb packages on stretch-wikimedia/thirdparty/kubeadm-k8s-1-21 (T282942) [10:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:23] T282942: Upgrade Toolforge Kubernetes to latest 1.21 - https://phabricator.wikimedia.org/T282942 [10:07:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] WMCS haproxy: set expose-fd listeners for all services [puppet] - 10https://gerrit.wikimedia.org/r/737986 (owner: 10Andrew Bogott) [10:08:01] (03PS2) 10JMeybohm: Fix helm3 lint errors and helm dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738980 [10:08:03] (03PS3) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [10:08:05] (03PS1) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 [10:08:40] (03PS1) 10Jgiannelos: tile-pregeneration: Make script less verbose [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739123 [10:09:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) @cmooney, I agree with your take on the security aspect. We're not in a typical service provider (ISP)/customer relations... [10:10:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) (owner: 10Majavah) [10:14:11] (03CR) 10JMeybohm: [V: 03+2] Run helmfile commands against the local version of the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [10:14:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:15:10] !log installing testvm2001 [10:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:27] (03CR) 10JMeybohm: [C: 04-2] "I'm still not sure why this chain ends up producing a 100% diff for echostore, sessionstore and toolhub" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm) [10:20:00] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [10:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:08] !log A:cp re-enable puppet after successful test on cp402[17] T293879 [10:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:11] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [10:24:42] (03CR) 10David Caro: [V: 03+1 C: 03+2] "All PCC changes were expected (only parameters, no actual resources)" [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [10:24:48] (03CR) 10David Caro: [C: 03+2] ceph::auth::keyring: allow passing the full client name [puppet] - 10https://gerrit.wikimedia.org/r/738903 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [10:25:36] (03PS1) 10Kormat: db1112: Re-enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/739126 (https://phabricator.wikimedia.org/T294295) [10:26:11] 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10MoritzMuehlenhoff) >>! In T295584#7505985, @aborrero wrote: > OK! We don't really care about the OS drive size. What's important here is the extra drive for LVM, which should have at leas... [10:26:33] (03CR) 10Kormat: [C: 03+2] db1112: Re-enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/739126 (https://phabricator.wikimedia.org/T294295) (owner: 10Kormat) [10:29:07] 10SRE, 10ops-drmrs, 10Traffic: Degraded RAID on cp6002 - https://phabricator.wikimedia.org/T295747 (10Peachey88) [10:30:59] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6005.drmrs.wmnet with OS buster [10:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:09] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster completed: - cp6005 (**WARN**)... [10:35:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) From a certain point of view what we're doing here is validating [[https://wikitech.wikimedia.org/wiki/Cross-Realm_traffi... [10:39:42] (03PS1) 10David Caro: ceph::auth: require load_all when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) [10:40:38] PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [10:41:16] (03CR) 10jerkins-bot: [V: 04-1] ceph::auth: require load_all when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [10:42:56] PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [10:43:00] (03PS2) 10David Caro: ceph::auth: require load_all when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) [10:43:11] (03CR) 10Majavah: P::kerberos: automate principal management (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [10:45:44] (03PS1) 10Hnowlan: api-gateway: disable debug logging outside of staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717) [10:48:55] (03PS3) 10David Caro: p:{osd,backup_glance_images,backy2}: require ceph::auth::deploy when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) [10:50:33] (03CR) 10jerkins-bot: [V: 04-1] p:{osd,backup_glance_images,backy2}: require ceph::auth::deploy when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [10:51:41] (03CR) 10Muehlenhoff: "Didn't find the time yet to read through it at large yet, but one comment line." [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [10:52:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow mtail to match all handlers [puppet] - 10https://gerrit.wikimedia.org/r/738918 (owner: 10Giuseppe Lavagetto) [10:56:16] RECOVERY - RPKI Validator RTR port on rpki2001 is OK: TCP OK - 0.032 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [10:59:36] (03PS4) 10David Caro: p:{osd,b_g_images,backy2}: require c::a::deploy when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) [10:59:36] PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [11:01:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There is an error in the monitoring check." [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [11:01:33] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32433/console" [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:01:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service/miscweb: switch state from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/694628 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [11:03:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [11:03:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:55] (03Abandoned) 10Awight: Remove deprecated QuickSurveys config fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604895 (owner: 10Awight) [11:07:01] (03CR) 10JMeybohm: [C: 03+2] Fix helm3 lint errors and helm dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738980 (owner: 10JMeybohm) [11:07:33] (03CR) 10Awight: "I recommend we use "layout" everywhere and make it a mandatory field." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604681 (https://phabricator.wikimedia.org/T255130) (owner: 10Awight) [11:07:37] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:07:56] (03CR) 10JMeybohm: Auto add helm chart repositories (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm) [11:08:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:09:03] 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10hnowlan) This happens because of how DEPLOY_HEAD retains the last-used deploy server name and unless explicitly told to ignore, it w... [11:10:41] (03CR) 10Awight: "I learned that this class is adapted from https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/SiteC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE)) [11:11:00] (03CR) 10Jbond: [C: 03+1] Update approver for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [11:12:09] (03Merged) 10jenkins-bot: Fix helm3 lint errors and helm dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738980 (owner: 10JMeybohm) [11:13:47] (03PS10) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [11:14:33] (03PS2) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 [11:20:27] 10SRE, 10vm-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) p:05Triage→03Medium a:03aborrero >>! In T295584#7506073, @MoritzMuehlenhoff wrote: >>>! In T295584#7505985, @aborrero wrote: >> OK! We do... [11:26:16] RECOVERY - Routinator process on rpki2001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [11:31:39] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [11:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:24] PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [11:33:46] (03CR) 10Majavah: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah) [11:34:04] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6006.drmrs.wmnet with OS buster [11:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:13] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster [11:36:03] (03PS1) 10Jbond: P:netbox::scripts: use role_hosts to get ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/739139 [11:36:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32434/console" [puppet] - 10https://gerrit.wikimedia.org/r/739139 (owner: 10Jbond) [11:40:35] (03PS3) 10Majavah: Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 [11:40:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Autonyms look okay now, but the commonswiki part is still missing. (Also, needs a rebase apparently. I guess some more language codes were" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [11:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:46:13] (03PS1) 10Hashar: ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) [11:46:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: introduce role for cloudbackup-dev [puppet] - 10https://gerrit.wikimedia.org/r/738376 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [11:47:02] (03CR) 10jerkins-bot: [V: 04-1] ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) (owner: 10Hashar) [11:47:51] 10SRE, 10Infrastructure-Foundations, 10netops: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) a:03cmooney So of course there is a complication. Currently we have a single BGP session between adjacent CR routers, peered over the loopback IPv4 addresses either si... [11:49:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox::scripts: use role_hosts to get ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/739139 (owner: 10Jbond) [11:50:06] (03CR) 10Volans: [C: 03+2] "Tested on netbox-next on all devices." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738274 (owner: 10Volans) [11:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:50:59] (03Merged) 10jenkins-bot: scripts: clean temporary code from PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738274 (owner: 10Volans) [11:51:06] (03PS2) 10Hashar: ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) [11:52:54] (03PS11) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [11:54:32] (03PS3) 10Hashar: ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) [11:55:30] !log failover ganeti master in test cluster to ganeti-test2002 [11:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:11] RECOVERY - RPKI Validator RTR port on rpki2001 is OK: TCP OK - 0.034 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [11:59:17] PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1200). [12:00:04] No Gerrit patches in the queue for this window AFAICS. [12:00:20] that’s good, because I’m off for lunch in a moment ^^ [12:00:42] (there’s a Wikibase-related config change in the pipeline but it needs a bit more work anyways) [12:02:21] (03PS4) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [12:03:38] * urbanecm waves anyway, in case a deployer's needed [12:05:46] (03PS1) 10Volans: scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 [12:06:13] (03CR) 10Mbch331: "Now commons and Wikidata should be in sync and I've rebased the code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [12:13:32] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6006.drmrs.wmnet with OS buster [12:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:41] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster completed: - cp6006 (**WARN**)... [12:14:59] (03CR) 10DCausse: [C: 03+1] "lgtm," [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) (owner: 10Ebernhardson) [12:17:11] PROBLEM - ganeti-wconfd running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:17:16] (03PS1) 10Jcrespo: dbbackups: Reorganize backups so we move s1 and s2 into dbprovX001 [puppet] - 10https://gerrit.wikimedia.org/r/739217 (https://phabricator.wikimedia.org/T280979) [12:19:47] (03PS3) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) [12:20:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I agree with the idea, but implementation needs a bit of work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [12:20:47] (03CR) 10Hashar: "Cherry picked on integration-puppetmaster02 and confirmed to work." [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) (owner: 10Hashar) [12:21:28] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [12:22:25] 10SRE-tools, 10Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10Volans) 05Open→03In progress p:05Triage→03Medium [12:22:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [12:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:08] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [12:24:08] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:24:30] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6007.drmrs.wmnet with OS buster [12:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:40] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster [12:25:42] (03CR) 10Urbanecm: [C: 04-1] GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [12:26:11] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [12:26:35] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10MoritzMuehlenhoff) 05In progress→03Resolved The new Ganeti test cluster has been setup: It consists of three nodes in row A of codfw (ganeti-test200[1-3].codfw.wmnet). A test... [12:26:44] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:29:41] (03CR) 10Urbanecm: [C: 04-1] "+, wgGENewcomerTasksLinkRecommendationsEnabled is no longer prefixed with a hyphen? Is that intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [12:29:49] !log installing Linux 4.19.208 updates on buster hosts (no reboots) [12:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:07] 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) [12:31:21] 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) p:05Triage→03Medium [12:31:23] (03PS1) 10Urbanecm: [beta] Set wgGENewcomerTasksLinkRecommendationsEnabled to false everywhere but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739225 [12:33:20] (03CR) 10Urbanecm: [C: 03+2] [beta] Set wgGENewcomerTasksLinkRecommendationsEnabled to false everywhere but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739225 (owner: 10Urbanecm) [12:33:56] (03PS1) 10Ema: prometheus:ops: add varnishmtail-internal jobs [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879) [12:34:02] (03Merged) 10jenkins-bot: [beta] Set wgGENewcomerTasksLinkRecommendationsEnabled to false everywhere but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739225 (owner: 10Urbanecm) [12:36:50] (03PS4) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) [12:36:52] (03PS2) 10Ema: prometheus:ops: add varnishmtail-internal jobs [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879) [12:37:36] (03PS12) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:38:25] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [12:38:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove PHP 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739006 (owner: 10Legoktm) [12:41:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:30] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:54:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Jelto) [12:54:11] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) [12:54:18] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:57:29] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups so we move s1 and s2 into dbprovX001 [puppet] - 10https://gerrit.wikimedia.org/r/739217 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [12:57:37] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: introduce datatype for configuration hash [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) [12:58:05] (03PS5) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) [12:58:07] (03PS1) 10Ema: varnish: remove internal mtail scripts from default instance [puppet] - 10https://gerrit.wikimedia.org/r/739229 (https://phabricator.wikimedia.org/T293879) [12:58:56] 10SRE-Access-Requests: Requesting access to releasers-wikibase for Rosalie_WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) [12:59:54] 10SRE-Access-Requests: Requesting access to releasers-wikibase for Rosalie_WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) Looking at the previous request of this kind (T269777) I am unclear who should be requested to approve on WMF's end? @thcipriani ? Please advise, thank you. I approve this... [13:03:18] 10SRE-Access-Requests, 10Wikibase Release Strategy: Requesting access to releasers-wikibase for Rosalie_WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) [13:03:56] PROBLEM - Check systemd state on ping2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:33] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6007.drmrs.wmnet with OS buster [13:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:42] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster completed: - cp6007 (**WARN**)... [13:04:50] PROBLEM - Check systemd state on ping1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:03] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [13:05:38] !log installing psmisc bugfix updates on buster hosts [13:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:43] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:08:11] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [13:14:06] PROBLEM - Disk space on ping2001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=66%): /tmp 0 MB (0% inode=66%): /var/tmp 0 MB (0% inode=66%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops [13:15:21] (03PS2) 10Hnowlan: partmon: add reuse partmon profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [13:18:30] 10SRE, 10Infrastructure-Foundations, 10netops: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10MoritzMuehlenhoff) [13:18:33] (03PS3) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [13:18:43] !log prune unused packages from ping1001/ping2001 T295767 [13:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:46] T295767: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 [13:19:40] PROBLEM - Disk space on ping3001 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=66%): /tmp 1 MB (0% inode=66%): /var/tmp 1 MB (0% inode=66%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops [13:19:48] (03CR) 10David Caro: [C: 03+1] ceph: auth: introduce datatype for configuration hash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:20:28] RECOVERY - Check systemd state on ping1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:42] !log prune unused packages from ping3001 T295767 [13:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:47] !log installing debconf bugfix updates on buster [13:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:50] RECOVERY - Routinator process on rpki2001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [13:29:12] PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [13:30:57] (03PS1) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 [13:32:13] 10SRE-Access-Requests, 10Wikibase Release Strategy: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) [13:32:38] XioNoX: topranks: fyi, routinator checks on rpki2001 seems to be flapping ^^ [13:32:47] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [13:33:02] majavah: thanks, looking now [13:33:21] (03CR) 10Lucas Werkmeister (WMDE): Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [13:34:02] RECOVERY - RPKI Validator RTR port on rpki2001 is OK: TCP OK - 0.032 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [13:34:11] (03CR) 10jerkins-bot: [V: 04-1] (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond) [13:34:16] RECOVERY - Routinator process on rpki2001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [13:34:24] RECOVERY - Disk space on ping2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops [13:34:49] "No space left on device" [13:35:06] Ironically I was going to rebuild it later today to add more space. [13:35:20] you jinxed it :-) [13:35:27] lol yep! [13:35:43] (03PS4) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [13:35:45] (03PS3) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 [13:35:53] Anyway I'll ack the alert for now and then do just that, no point faffing about trying to free space on the existing one. [13:35:56] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [13:35:56] (03CR) 10JMeybohm: Run helmfile commands against the local version of the chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [13:36:02] majavah: thanks for the heads up :) [13:36:09] (03CR) 10jerkins-bot: [V: 04-1] Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm) [13:36:15] (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [13:36:16] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [13:36:18] PROBLEM - Host cp2027 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:02] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw Cathal Mooney Ran out of disk, rebuilding with a bigger one. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:38:14] RECOVERY - Host cp2027 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [13:38:18] PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [13:38:34] PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [13:39:47] ACKNOWLEDGEMENT - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused Cathal Mooney Ran out of space, will rebuild with more. https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [13:40:02] ACKNOWLEDGEMENT - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator Cathal Mooney Ran out of space, will rebuild with more. https://wikitech.wikimedia.org/wiki/RPKI%23Process [13:40:16] RECOVERY - Disk space on ping3001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops [13:42:10] (03PS7) 10Muehlenhoff: Obsolete role::restbase::base [puppet] - 10https://gerrit.wikimedia.org/r/729943 [13:48:36] (03PS2) 10Arturo Borrero Gonzalez: ceph: auth: introduce datatype for configuration hash [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) [13:48:38] (03PS4) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [13:48:40] (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: libvirt: migrate to new ceph auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) [13:49:10] (03CR) 10Arturo Borrero Gonzalez: "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:49:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/729943 (owner: 10Muehlenhoff) [13:51:11] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: libvirt: migrate to new ceph auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:51:38] !log cmooney@cumin2002 START - Cookbook sre.ganeti.makevm for new host rpki2001.codfw.wmnet [13:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:26] RECOVERY - Check systemd state on ping2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:56] 10SRE, 10Tracking-Neverending: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Jelto) [13:54:31] 10SRE, 10Tracking-Neverending: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Jelto) p:05Triage→03Low [13:55:32] (03CR) 10Volans: (WIP) initial cookbook for syncing netbox puppet data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond) [13:57:30] (03PS1) 10Cathal Mooney: Update IP address for RPKI Validator session to rpki2001 [homer/public] - 10https://gerrit.wikimedia.org/r/739237 (https://phabricator.wikimedia.org/T292503) [13:58:57] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host rpki2001.codfw.wmnet [13:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] (03PS13) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [14:01:29] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10JKieserman) Hey Daniel, Yes sorry about that! I'm a software engineer on the abstract team, reporting to Cai Blanton. Let me know what other information would be useful! Cheers, Julia [14:05:22] 10SRE, 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10Volans) [14:05:25] (03PS14) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [14:05:54] (03CR) 10Ayounsi: "The change itself lgtm, but the IP doesn't have a DNS record." [homer/public] - 10https://gerrit.wikimedia.org/r/739237 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [14:05:58] (03CR) 10David Caro: cloud: ceph: libvirt: migrate to new ceph auth abstraction (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:06:33] (03CR) 10David Caro: [C: 03+1] ceph: auth: introduce datatype for configuration hash [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:08:34] (03CR) 10Ayounsi: [C: 03+1] scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans) [14:08:53] (03PS5) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [14:09:19] !log cmooney@cumin2002 START - Cookbook sre.hosts.decommission for hosts rpki2001.codfw.wmnet [14:09:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) [14:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:22] (03PS8) 10Muehlenhoff: Obsolete role::restbase::base [puppet] - 10https://gerrit.wikimedia.org/r/729943 [14:11:42] (03CR) 10Jbond: [C: 03+1] scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans) [14:12:55] (03CR) 10Herron: [C: 03+1] logstash: reconstruct gitlab sidekiq message field [puppet] - 10https://gerrit.wikimedia.org/r/739018 (https://phabricator.wikimedia.org/T295731) (owner: 10Cwhite) [14:14:49] (03PS2) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 [14:15:05] 10SRE, 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: cloudnet VLAN Netbox discrepancies - https://phabricator.wikimedia.org/T295776 (10Volans) [14:15:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/729943 (owner: 10Muehlenhoff) [14:17:54] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Disable debugging on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/739118 (owner: 10Jgiannelos) [14:18:00] !log cmooney@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rpki2001.codfw.wmnet [14:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin2002 for hosts: `rpki2001.codfw.wmnet` - rpki2001.codfw.wmnet (**PAS... [14:18:20] (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Make script less verbose [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739123 (owner: 10Jgiannelos) [14:18:26] (03CR) 10jerkins-bot: [V: 04-1] (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond) [14:19:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update approver for gitlab-roots/vrts-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [14:19:10] 10SRE, 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10elukey) The only recent thing that I recall is T276239, but not for all workers mentioned. I checked quickly the dry-run for... [14:19:24] (03Merged) 10jenkins-bot: tile-pregeneration: Make script less verbose [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739123 (owner: 10Jgiannelos) [14:20:09] (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Fix argument order for batching [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115 (owner: 10Jgiannelos) [14:21:40] (03Merged) 10jenkins-bot: tile-pregeneration: Fix argument order for batching [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115 (owner: 10Jgiannelos) [14:22:26] (03Merged) 10jenkins-bot: tegola-vector-tiles: Disable debugging on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/739118 (owner: 10Jgiannelos) [14:22:36] !log cmooney@cumin2002 START - Cookbook sre.ganeti.makevm for new host rpki2002.codfw.wmnet [14:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:57] (03PS2) 10Muehlenhoff: Update approver for os-installers [puppet] - 10https://gerrit.wikimedia.org/r/738837 [14:23:31] (03CR) 10Lucas Werkmeister (WMDE): Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [14:23:48] (03CR) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond) [14:24:03] (03Abandoned) 10Cathal Mooney: Update IP address for RPKI Validator session to rpki2001 [homer/public] - 10https://gerrit.wikimedia.org/r/739237 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [14:24:49] !log re-adding backup user to db1108:analytics_meta T284150 [14:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:53] T284150: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 [14:25:55] (03PS1) 10Elukey: sre.druid.roll-restart-workers: restart Druid exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/739240 [14:26:03] (03PS3) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 [14:27:25] (03CR) 10Muehlenhoff: [C: 03+2] Update approver for os-installers [puppet] - 10https://gerrit.wikimedia.org/r/738837 (owner: 10Muehlenhoff) [14:27:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) >>! In T293253#7504047, @DAbad wrote: > Public Key: AAAAC3NzaC1lZDI1NTE5AAAAIEMCL89wONrqDKRSFKETmGNyQ5OCPlZWjDpYODpBXOMg Could you check your pasted ssh public key ag... [14:30:14] (03PS3) 10Muehlenhoff: Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 [14:30:35] (03PS1) 10Jgiannelos: maps: Make silent cURL requests on tile invalidation [puppet] - 10https://gerrit.wikimedia.org/r/739241 [14:30:37] (03CR) 10Muehlenhoff: Update approvers for gitlab-roots/vrts-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [14:31:08] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host rpki2002.codfw.wmnet [14:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [14:31:23] (03CR) 10LSobanski: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [14:31:35] !log cmooney@cumin2002 START - Cookbook sre.ganeti.makevm for new host rpki2001.codfw.wmnet [14:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:24] (03CR) 10Jgiannelos: "I checked the cronjob logs and the tile invalidation cURL requests are a bit noisy. This patch makes them silent and only show errors." [puppet] - 10https://gerrit.wikimedia.org/r/739241 (owner: 10Jgiannelos) [14:32:57] (03PS4) 10Muehlenhoff: Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 [14:33:41] (03CR) 10Jgiannelos: "recheck" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115 (owner: 10Jgiannelos) [14:34:33] (03CR) 10Muehlenhoff: [C: 03+2] Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [14:39:36] 10SRE, 10Tracking-Neverending: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez LE staging environment had a rough time :) It's fixed now [14:39:38] 10SRE, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Vgutierrez) [14:44:22] !log cmooney@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host rpki2001.codfw.wmnet [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:36] (03CR) 10Giuseppe Lavagetto: Run helmfile commands against the local version of the chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [14:47:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:50:19] (03PS1) 10Cathal Mooney: Updating MAC address in DHCP config for rpki2001 [puppet] - 10https://gerrit.wikimedia.org/r/739242 (https://phabricator.wikimedia.org/T292503) [14:51:06] (03CR) 10Cathal Mooney: [C: 03+2] Updating MAC address in DHCP config for rpki2001 [puppet] - 10https://gerrit.wikimedia.org/r/739242 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [14:52:10] (03PS1) 10Jbond: P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) [14:52:18] (03CR) 10Volans: [C: 03+2] scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans) [14:52:43] (03CR) 10jerkins-bot: [V: 04-1] P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [14:53:14] (03Merged) 10jenkins-bot: scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans) [14:56:01] (03PS2) 10Jbond: P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) [14:56:54] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/739267 [14:59:07] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) 05Open→03Resolved [14:59:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32438/console" [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [14:59:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond) [15:04:39] 10SRE, 10Infrastructure-Foundations, 10netops: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) For the sake of completeness, another option could be to add the fffff: IP to the loopback address, but that would be more of a workaround than a long term solution.... [15:06:12] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/739267 (owner: 10Jgiannelos) [15:09:05] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:10:07] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:10:29] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/739267 (owner: 10Jgiannelos) [15:12:25] (03PS6) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [15:13:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10Volans) [15:13:54] 10SRE, 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: cloudnet VLAN Netbox discrepancies - https://phabricator.wikimedia.org/T295776 (10Volans) 05Open→03Resolved a:03Volans After verifying that the changes were all expected and the VLAN bits were actually an artifact of how... [15:14:23] (03CR) 10Mbch331: Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [15:15:55] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Jgreen) @ayounsi I think it would be fine to do the codfw pfw's this year. Please ping me on IRC when you have some time to discuss. [15:21:09] (03PS4) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [15:21:57] 10SRE, 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) Yes I thought this was a bit odd. I saw there was a bit of re-imaging here: T231067#6891049 but that was before my t... [15:22:14] 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) a:03BTullis [15:22:41] (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:24:15] (03PS1) 10Elukey: Revert "Update deployment-prep's profile::base::certificates settings" [puppet] - 10https://gerrit.wikimedia.org/r/739260 [15:24:32] (03Abandoned) 10Elukey: Revert "Update deployment-prep's profile::base::certificates settings" [puppet] - 10https://gerrit.wikimedia.org/r/739260 (owner: 10Elukey) [15:26:32] 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1) The linked patch does not seem related. Has this been copied from somewhere? [15:27:10] (03CR) 10MSantos: [C: 03+1] maps: Make silent cURL requests on tile invalidation [puppet] - 10https://gerrit.wikimedia.org/r/739241 (owner: 10Jgiannelos) [15:28:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: auth: introduce datatype for configuration hash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [15:30:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC noop https://puppet-compiler.wmflabs.org/compiler1002/32440/" [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [15:32:49] PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100% [15:33:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) Hello Carol We received an access request from Desiree Abad to the group analytics-privatedata-users. Desiree wants to work on Analytics & Metrics Platform to service... [15:33:41] (03CR) 10Eigyan: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [15:34:08] (03PS5) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [15:34:10] (03PS4) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 [15:34:33] PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:35] PROBLEM - Host durum5001 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:55] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:57] oh oh [15:35:41] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:16] (03PS5) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [15:36:21] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:36:47] 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10KSiebert) Yes @RhinosF1 I copied my teammates request because I realized that my permissions in superset are kind of limited and I don't understand how. [15:37:02] ^ does anyone know what's up in eqsin? [15:37:09] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:37:20] No just seen it. Netbox lists ganeti5002 as "failed" I see. [15:37:22] (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [15:37:44] (yeah, that would explain the doh and durum hosts) [15:37:46] My assumption was that server failed and took the VMs with it [15:37:48] (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:37:58] But not sure, Netbox wouldn't be magically aware of a random failure. [15:38:19] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:20] oh yeah so then the BGP/BFD alerts might be related to the doh and durum hosts being down [15:38:39] I expect so [15:38:50] 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1) No problem, whoever is on clinic duty this week will assist and add you. That should be @Jelto. [15:38:51] That host is up on it's iDRAC / dedicated management interface anyway [15:38:56] cmooney@cumin2002:~$ ping ganeti5002.mgmt.eqsin.wmnet [15:38:56] PING ganeti5002.mgmt.eqsin.wmnet (10.132.129.114) 56(84) bytes of data. [15:38:56] 64 bytes from ganeti5002.mgmt.eqsin.wmnet (10.132.129.114): icmp_seq=1 ttl=61 time=218 ms [15:38:56] 64 bytes from wmf7194.mgmt.eqsin.wmnet (10.132.129.114): icmp_seq=2 ttl=61 time=218 ms [15:39:31] 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1) [15:40:14] 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1) [15:40:35] Certainly looks down, from ganeti5001 it has no stats for it, but ganeti5002 does appear to be in the cluster [15:40:41] https://www.irccloud.com/pastebin/1GspDp5R/ [15:41:08] ah [15:41:12] ineed [15:41:16] indeed even :) [15:43:42] ineed my VMs to work says sukhe :D [15:43:43] (03CR) 10Lucas Werkmeister (WMDE): Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [15:44:04] topranks: :D [15:44:08] I can't get onto the iDRAC interface of that box from bast5001, but unsure if that should be possible (may be blocked by fw). [15:44:50] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10WMDE-Fisch) [15:44:56] (03CR) 10Ahmon Dancy: [C: 03+1] ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) (owner: 10Hashar) [15:45:30] sukhe: so it looks to me like the host has failed. [15:45:41] (03PS6) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [15:46:02] And I expect we probably take 2 paralell actions - re-deploy any missing VMs to the other Ganeti hosts, and work on getting the server back running / replaced. [15:46:28] moritzm, XioNoX: Does that sound right (sry don't know who to pick on here) [15:46:30] (03PS5) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 [15:46:32] (03PS6) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [15:47:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10Tobi_WMDE_SW) Approving that @WMDE-Fisch is in my team and needs the access for the stated reasons. Would be awesome if it could be granted. [15:47:14] (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:47:41] Looks like only VMs that were on it are the ones that have alerted [15:47:49] https://www.irccloud.com/pastebin/YXctu6pD/ [15:48:02] (03CR) 10JMeybohm: "I did also change order of commits so that CI should no longer fail...maybe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [15:48:07] (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [15:48:21] well...that went nicely :) [15:48:28] (03PS7) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [15:49:37] (03PS7) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [15:49:47] topranks: I'd say ask the service owners, o11y for prometheus, sukhe for the other 2, and file a high priority task for DCops [15:49:59] ok thanks for the advice [15:50:02] (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:50:17] not my day... [15:50:21] but I'd guess it's going to be a yes on rebuilding the instances :) [15:50:48] vgutierrez: it's more your day than ganeti5002's day [15:50:50] :) [15:50:59] sukhe: see the advice above, sounds like rebuilding the VMs is the way to go here. [15:51:43] robh: ^ warning, incoming high priority task about a failed ganeti server in eqsin [15:52:24] (03PS1) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [15:52:51] hello, back [15:53:02] !log merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/525220 which makes read-only ldap the default for ldap clients [15:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [15:53:26] lmata: fyi, prometheus5001 is dead (See above) [15:53:38] so given that the doh and durum are anycasted, I am not worried about failing requests or anything (thankfully!) but yeah [15:53:48] XioNoX: so out of curiosity, what happened here? [15:54:00] thanks XioNoX will look into it [15:54:14] (thanks topranks and XioNoX btw) [15:54:25] sukhe: I'm not sure, the iDRAC mangement of that server responds to pings, the main IP that the debian OS is using is not responding. [15:54:35] Could be anything from a hardware failure to a kernel panic or something. [15:55:01] I couldn't reach the iDRAC virtual console, but I only tried from bast5001, not sure if the connection is allowed from there. [15:55:14] DC Ops can investigate and advise, but the host is dead right now. [15:55:19] topranks: idrac over https can only be reached from the cumin hosts, ssh from bast as well (iirc) [15:55:26] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [15:57:07] topranks: let me know if you need any help [15:58:43] I'm moving instances off ganeti5002 now [15:59:36] moritzm: out of curiosity, does that mean re-building the instances, or the storage is duplicated on the other nodes as well? [16:00:16] we'd move to the secondary instances, was there any luck with ganeti5002's mgmt? [16:00:24] It's up to ping. [16:00:36] do we get anything in the SEL? [16:01:34] moritzm: robh is looking into it in -dcops [16:01:49] last event is from Oct about a power blip [16:02:26] it's oopsing [16:02:38] I get a root login prompt [16:02:43] somewhere in KVM [16:02:49] I'm going to powercycle [16:03:07] +1 [16:03:09] it's ooming [16:03:21] moritzm: please sync up with robh as well [16:03:30] I think he is about to reboot it as well [16:03:31] Ok [16:03:39] it's at a login prompt ont he virtual console [16:04:12] https://usercontent.irccloud-cdn.com/file/mVlyFjDt/image.png [16:04:28] topranks: try to login and you'll see the oom [16:04:31] 10SRE, 10ops-eqsin, 10DC-Ops: Failed host: ganetti5002 - https://phabricator.wikimedia.org/T295783 (10RhinosF1) [16:04:49] 10SRE, 10ops-eqsin, 10DC-Ops: Failed host: ganeti5002 - https://phabricator.wikimedia.org/T295783 (10RhinosF1) [16:05:02] 16:04:54 up 285 days, 1:37, 1 user, load average: 65.44, 63.86, 56.40 [16:05:08] !log powercycling ganeti5002 [16:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:28] (03CR) 10Eigyan: [C: 03+1] Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:06:30] just occured to me - oom = "out of memory" [16:06:49] console was full of stack traces [16:06:52] ok thanks volans, reboot will reset that anyway. [16:07:01] rebooting bios now [16:07:12] 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) ==an-worker1104== ====Current interfaces snapshot: {F34750374,width=600} ====Current interfaces: * eno1 - SFTP+ - connected... [16:07:12] ganeti memory leak bug or something? [16:07:27] robh: I'm off the console if you need it [16:09:44] (03PS9) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:10:01] topranks: I think some kernel bug [16:10:08] ok [16:10:31] RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 248.27 ms [16:10:48] \o/ [16:10:49] hm! [16:10:54] (03PS8) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [16:11:27] due to the powercycle we now a more recent kernel as well (4.19.208 over 4.19.171), but if we're lucky whatever hit it, is already backported into 4.9.208 :-) [16:12:51] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [16:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:29] RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 247.37 ms [16:15:17] RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 247.83 ms [16:15:35] RECOVERY - Host durum5001 is UP: PING OK - Packet loss = 0%, RTA = 248.56 ms [16:15:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [16:16:21] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32441/console" [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:17:11] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:57] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:18:31] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 73, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:07] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:19:09] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 327, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:17] PROBLEM - Check systemd state on durum5001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:58] 10SRE, 10ops-eqsin, 10DC-Ops: Failed host: ganeti5002 - https://phabricator.wikimedia.org/T295783 (10MoritzMuehlenhoff) 05Open→03Resolved I powercycled the server over the mgmt and it came back up fine. Closing since there's no fixable hardware issue. As part of the reboot (coincidentally I had rolled ou... [16:21:27] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:50] (03CR) 10Cwhite: [C: 03+1] "Thanks for this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan) [16:22:27] !log systemctl reset-failed ifup@esn13 on durum5001 after restart T273026 [16:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:31] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [16:23:33] RECOVERY - Check systemd state on durum5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:34] !log systemctl reset-failed ifup@ens13 on prometheus5001 T273026 [16:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:56] (03CR) 10Cwhite: [C: 03+2] logstash: reconstruct gitlab sidekiq message field [puppet] - 10https://gerrit.wikimedia.org/r/739018 (https://phabricator.wikimedia.org/T295731) (owner: 10Cwhite) [16:23:58] (03CR) 10Volans: cookbook sre.idm.u2f: add cookbook to enable/disable u2f (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [16:27:28] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [16:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32442/console" [puppet] - 10https://gerrit.wikimedia.org/r/738919 (owner: 10Giuseppe Lavagetto) [16:35:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "The change is a noop everywhere, so at the very least it's not harmful to merge." [puppet] - 10https://gerrit.wikimedia.org/r/738919 (owner: 10Giuseppe Lavagetto) [16:37:15] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [16:42:20] (03PS2) 10Dzahn: remove scholarships.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/738028 (https://phabricator.wikimedia.org/T243037) [16:43:11] (03PS1) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) [16:43:44] (03CR) 10Dave Pifke: [C: 03+1] "Looks good now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [16:47:09] (03PS1) 10Ahmon Dancy: Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [16:47:41] (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [16:48:50] (03PS2) 10Ahmon Dancy: Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [16:50:21] (03CR) 10Volans: "LGTM, couple of minor nits inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [16:52:52] (03PS3) 10Ahmon Dancy: Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [16:56:42] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [16:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:57:51] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 100 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [16:58:56] (03CR) 10Volans: [C: 03+1] "Although I'm not familiar with the underlying DB, LGTM, nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [16:59:51] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [17:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:30] puppet window complete ✅ [17:01:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:02:07] 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10dancy) >>! In T197470#7506161, @hnowlan wrote: > For the immediate term if there are no objections I will replace all instances of `... [17:02:10] (03CR) 10Majavah: P::kerberos: automate principal management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [17:02:40] 10SRE, 10Scap, 10Release-Engineering-Team (Priority Backlog 📥): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10dancy) [17:08:48] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul) @MoritzMuehlenhoff @RobH the 2 ganeti nodes are we racking them in a 10G rack or 1G? "Networking/Subnet/VLAN/IP: 10G, same VLAN/IP setup as existing Ganeti serv... [17:11:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:40] (03PS1) 10Majavah: acme_chief: add -rw to ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/739283 (https://phabricator.wikimedia.org/T295150) [17:18:22] (03CR) 10Dzahn: [C: 03+2] "removed from ATS yesterday, nothing in apache access logs" [dns] - 10https://gerrit.wikimedia.org/r/738028 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:18:32] (03CR) 10Muehlenhoff: apereo_cas: add cas_u2f script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [17:19:37] (03PS1) 10Majavah: wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) [17:19:48] (03PS2) 10Majavah: wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) [17:20:02] !log removing scholarships.wikimedia.org from DNS - T243037 [17:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:05] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [17:20:11] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10MoritzMuehlenhoff) >>! In T294139#7507224, @Papaul wrote: > @MoritzMuehlenhoff @RobH the 2 ganeti nodes are we racking them in a 10G rack or 1G? If there's sufficient s... [17:22:04] (03CR) 10Jeena Huneidi: "recheck" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot) [17:22:12] (03PS1) 10AOkoth: gitlab: re-enable restore timer [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580) [17:22:27] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm but because we already merged one ldap hostname patch today I'm going to wait a bit before merging this one." [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [17:23:10] (03CR) 10Jeena Huneidi: [C: 03+2] Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot) [17:23:35] (03PS5) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [17:24:48] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/compiler1002/32443/" [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth) [17:25:15] (03CR) 10Dzahn: [C: 03+1] gitlab: re-enable restore timer [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth) [17:25:49] (03PS2) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) [17:25:57] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10thcipriani) >>! In T295765#7506451, @WMDE-leszek wrote: > Looking at the previous requests of this kind (T269777, T28... [17:26:02] 10SRE, 10LDAP-Access-Requests, 10Security-Team: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10sbassett) [17:26:20] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10sbassett) [17:26:54] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [17:29:55] (03PS1) 10Btullis: Remove override for spark.local.dir on stat100x servers [puppet] - 10https://gerrit.wikimedia.org/r/739307 (https://phabricator.wikimedia.org/T295346) [17:30:14] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10sbassett) [17:31:07] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32444/console" [puppet] - 10https://gerrit.wikimedia.org/r/739307 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [17:31:24] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10sbassett) [17:31:27] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove override for spark.local.dir on stat100x servers [puppet] - 10https://gerrit.wikimedia.org/r/739307 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [17:36:35] (03PS3) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) [17:38:20] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Papaul) I downgrade Junos on QFX5100 at https://netbox.wikimedia.org/dcim/rack-elevations/ and did a request system zeroize on it . This is the one we will be using to repl... [17:38:44] (03CR) 10AOkoth: [C: 03+2] gitlab: re-enable restore timer [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth) [17:41:08] (03CR) 10Hnowlan: [C: 03+2] api-gateway: disable debug logging outside of staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan) [17:42:17] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot) [17:44:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Volans) [17:45:39] (03Merged) 10jenkins-bot: api-gateway: disable debug logging outside of staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan) [17:46:24] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) That works for me, thanks, can you send a calendar invite? Note that the link in your comment doesn't point to any specific device. [17:47:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:14] (03PS1) 10Hnowlan: api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/739308 (https://phabricator.wikimedia.org/T295717) [17:49:56] (03PS2) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [17:50:02] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [17:51:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:02] (03PS3) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [17:53:28] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/739308 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan) [17:54:22] (03PS4) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) [17:54:44] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [17:58:03] (03Merged) 10jenkins-bot: api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/739308 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan) [17:58:14] (03PS1) 10Volans: netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 [17:58:22] (03CR) 10Jbond: apereo_cas: add cas_u2f script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [17:58:50] (03PS2) 10Volans: netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148) [17:59:03] (03PS4) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [17:59:35] (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [18:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1800). [18:00:30] 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) So at first glance, this looks like the Netbox script will do the right thing. It will delete and recreate the the cable, bu... [18:00:41] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:00:45] (03PS5) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) [18:01:46] (03PS5) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [18:02:04] (03PS4) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [18:02:18] (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [18:02:37] (03CR) 10Jbond: [C: 03+1] netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148) (owner: 10Volans) [18:03:15] (03PS6) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [18:03:39] (03PS1) 10MSantos: mobileapps: bumpt to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311 [18:03:49] (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [18:04:29] (03PS7) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [18:04:46] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [18:04:57] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:05:01] (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [18:05:21] (03CR) 10Arturo Borrero Gonzalez: cloud: ceph: libvirt: migrate to new ceph auth abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [18:05:26] (03PS2) 10MSantos: mobileapps: bump to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311 [18:06:59] (03CR) 10Volans: [C: 03+1] "LGMT, nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [18:08:10] 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) @BTullis fwiw +1 from my end, thanks for having a look. [18:09:00] (03PS8) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [18:11:40] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311 (owner: 10MSantos) [18:13:10] (03PS9) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [18:16:38] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311 (owner: 10MSantos) [18:17:46] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:29] (03PS6) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) [18:19:34] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [18:21:09] I'm deploying mobileapps and started with the staging environment. After the deployment there are no pods running, any thoughts why this is happening? cc/ akosiaris and _joe_ [18:22:37] <_joe_> mbsantos: how did you determine there are no pods running? [18:22:57] helfile -e staging status would give me the list of pods running [18:23:12] I'm assuming this would still be the case [18:23:15] <_joe_> mbsantos: oh yes, that's the transition to helm 3 [18:23:51] <_joe_> helm 3 is probably now enabled everywhere for staging, and indeed helm status in helm3 doesn't give you the same amount of info [18:23:53] ah that makes sense, I was afraid to continue with deployment because of that [18:23:54] <_joe_> so if you do [18:23:57] I did "kube_env mobileapps staging" and "kubectl get pods" on deploy1002 and can see pods, fwiw [18:24:05] <_joe_> that ^^ [18:24:16] <_joe_> thanks mutante I was typing exactly that :) [18:24:21] thanks mutante and _joe_ [18:24:25] glad it was right [18:24:30] <_joe_> but yes, this is a change that should be highlighted to ops@ [18:24:41] <_joe_> cc jelto ^^ [18:26:21] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:37] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:33] (03PS5) 10Dzahn: add miscweb to LVS [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) [18:31:05] (03CR) 10Dzahn: add miscweb to LVS (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:31:40] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6004.drmrs.wmnet with OS bullseye [18:31:40] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6002.drmrs.wmnet with OS bullseye [18:31:40] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS bullseye [18:31:40] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6003.drmrs.wmnet with OS bullseye [18:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:49] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye [18:31:52] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6001.drmrs.wmnet with OS bullseye [18:31:54] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6004.drmrs.wmnet with OS bullseye [18:32:00] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye [18:32:53] (03PS15) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [18:33:21] (03CR) 10Jbond: hiera: create script endpoint for exporting hiera data (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [18:33:56] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739316 [18:34:00] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739316 (owner: 10Jeena Huneidi) [18:34:56] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739316 (owner: 10Jeena Huneidi) [18:34:59] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.9 refs T293950 [18:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:02] T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950 [18:37:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:59] !log moving mgmt cables from old msw to new msw in a2-eqiad [18:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:18] (03PS1) 10Herron: role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) [18:48:05] (03PS1) 10Herron: role::elasticsearch::relforge: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) [18:48:31] (03PS2) 10Herron: role::elasticsearch::relforge: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) [18:49:03] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:50:01] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:50:26] (03CR) 10Andrew Bogott: [C: 03+2] WMCS haproxy: set expose-fd listeners for all services [puppet] - 10https://gerrit.wikimedia.org/r/737986 (owner: 10Andrew Bogott) [18:51:21] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6002.drmrs.wmnet with OS bullseye [18:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:30] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye executed with errors: - gan... [18:53:16] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:54:48] (03PS1) 10Andrew Bogott: mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355) [18:55:22] !log moving mgmt cables from old msw to new msw in a3-eqiad [18:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:35] (03CR) 10jerkins-bot: [V: 04-1] mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [18:56:25] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6003.drmrs.wmnet with OS bullseye [18:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:33] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye executed with errors: - gan... [18:56:35] (03PS10) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [18:57:04] PROBLEM - Host wcqs1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:58:00] PROBLEM - Host db1141.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:58:22] PROBLEM - Host mw1414.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:58:35] PROBLEM - Host cloudservices1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:58:47] (03PS2) 10Andrew Bogott: mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355) [18:59:24] PROBLEM - Host maps1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:59:40] PROBLEM - Host analytics1059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1900) [19:00:24] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:01:02] !log moving mgmt cables from old msw to new msw in a4-eqiad [19:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:08] RECOVERY - Host wcqs1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [19:02:26] (03CR) 10Andrew Bogott: [C: 03+2] mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:03:18] RECOVERY - Host db1141.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [19:03:22] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:03:34] PROBLEM - Host labstore1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:03:38] PROBLEM - Host contint1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:03:42] RECOVERY - Host mw1414.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [19:03:46] PROBLEM - Host clouddb1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:03:48] PROBLEM - Host cloudelastic1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:03:52] PROBLEM - Host ganeti1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:03:52] RECOVERY - Host cloudservices1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [19:04:24] PROBLEM - Host ms-be1046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:04:36] RECOVERY - Host maps1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.01 ms [19:04:38] PROBLEM - Host netmon1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:04:56] RECOVERY - Host analytics1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [19:05:32] RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [19:06:44] !log moving mgmt cables from old msw to new msw in a5-eqiad [19:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:48] RECOVERY - Host netmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [19:08:08] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:06] RECOVERY - Host labstore1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [19:09:12] RECOVERY - Host contint1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.14 ms [19:09:22] RECOVERY - Host clouddb1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [19:09:24] RECOVERY - Host cloudelastic1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [19:09:30] RECOVERY - Host ganeti1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [19:10:02] RECOVERY - Host ms-be1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.31 ms [19:10:25] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6003.drmrs.wmnet with OS bullseye [19:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:38] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye [19:11:03] !log moving mgmt cables from old msw to new msw in a7-eqiad [19:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:10] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6002.drmrs.wmnet with OS bullseye [19:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:14] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6004.drmrs.wmnet with OS bullseye [19:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:19] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye [19:11:23] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6004.drmrs.wmnet with OS bullseye completed: - ganeti6004 (**... [19:11:31] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.9 refs T293950 (duration: 36m 32s) [19:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:35] T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950 [19:13:24] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:13:54] PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 813.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:14:11] (03PS11) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [19:14:18] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6001.drmrs.wmnet with OS bullseye [19:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:29] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6001.drmrs.wmnet with OS bullseye completed: - ganeti6001 (**... [19:14:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:43] (03PS1) 10Andrew Bogott: disable_tool: add ldap uri to the config file [puppet] - 10https://gerrit.wikimedia.org/r/739331 (https://phabricator.wikimedia.org/T170355) [19:14:44] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 11.38 ms [19:15:05] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10Majavah) [19:15:08] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.38.0-wmf.6 (duration: 03m 17s) [19:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:29] (03PS2) 10Herron: role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) [19:15:33] (03PS12) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [19:16:04] !log joal@deploy1002 Started deploy [analytics/refinery@194b11b]: Regular analytics weekly train [analytics/refinery@194b11b] [19:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:14] PROBLEM - Host mw1448.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:16:35] (03CR) 10jerkins-bot: [V: 04-1] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [19:17:41] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:18:17] !log moving mgmt cables from old msw to new msw in b1-eqiad [19:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:47] (03PS13) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) [19:18:54] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:19:33] asw? [19:19:42] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:20:35] majavah: if you're asking what does it mean, access switch [19:21:11] PROBLEM - Host kubestage1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:21:30] Amir1: I know, I'm just wondering why that went down since !logs are about msw's and that's a different row than what was just then being worked on [19:21:40] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: add ldap uri to the config file [puppet] - 10https://gerrit.wikimedia.org/r/739331 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:22:04] I mean it shouldn't even alert if it's down timed [19:22:08] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul) [19:22:09] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:22:30] it's a different row [19:22:35] RECOVERY - Host kubestage1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.22 ms [19:23:56] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [19:23:57] (03CR) 10Awight: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/739334 (https://phabricator.wikimedia.org/T295781) (owner: 10Awight) [19:24:17] 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: (Need By: TBD) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH) [19:24:31] PROBLEM - Host wcqs1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:24:43] (03CR) 10Brennen Bearnes: [C: 03+1] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [19:26:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10Papaul) [19:27:55] !log moving mgmt cables from old msw to new msw in b2-eqiad [19:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:18] 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: TBD) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH) p:05Medium→03High [19:29:29] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1759.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:29:49] RECOVERY - Host wcqs1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [19:30:47] PROBLEM - Host cloudcephmon1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:31:01] PROBLEM - Host clouddb1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:31:07] PROBLEM - Host ms-be1058.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:31:13] 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH) [19:31:16] 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH) [19:33:46] (03CR) 10Cathal Mooney: "Looks good! I've had a good look through and stepped through the scenarios I could imagine, I think it should cover the use-case for drmr" [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [19:34:12] (03CR) 10Cathal Mooney: [C: 03+1] Add drmrs switches to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [19:34:22] !log moving mgmt cables from old msw to new msw in b3-eqiad [19:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:42] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={LIST,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:35:54] RECOVERY - Host cloudcephmon1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [19:36:04] RECOVERY - Host clouddb1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.29 ms [19:36:10] RECOVERY - Host ms-be1058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [19:36:22] PROBLEM - Host conf1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:36:26] PROBLEM - Host db1104.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:36:42] PROBLEM - Host mw1429.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:10] PROBLEM - Host mw1428.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:10] PROBLEM - Host mw1430.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:10] PROBLEM - Host mw1431.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:10] PROBLEM - Host mw1432.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:10] PROBLEM - Host mw1433.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:38:18] !log joal@deploy1002 Finished deploy [analytics/refinery@194b11b]: Regular analytics weekly train [analytics/refinery@194b11b] (duration: 22m 14s) [19:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:42] RECOVERY - Host mw1448.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:39:46] !log joal@deploy1002 Started deploy [analytics/refinery@194b11b] (thin): Regular analytics weekly train THIN [analytics/refinery@194b11b] [19:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:53] !log joal@deploy1002 Finished deploy [analytics/refinery@194b11b] (thin): Regular analytics weekly train THIN [analytics/refinery@194b11b] (duration: 00m 07s) [19:39:54] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:00] RECOVERY - Host mw1430.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [19:40:04] !log joal@deploy1002 Started deploy [analytics/refinery@194b11b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@194b11b] [19:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:34] PROBLEM - Host moss-be1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:41:36] RECOVERY - Host conf1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [19:41:38] PROBLEM - Host copernicium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:41:38] RECOVERY - Host db1104.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [19:41:54] RECOVERY - Host mw1429.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.05 ms [19:42:26] RECOVERY - Host mw1428.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [19:42:26] RECOVERY - Host mw1431.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [19:42:26] RECOVERY - Host mw1433.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [19:42:26] RECOVERY - Host mw1432.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [19:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:43:13] !log moving mgmt cables from old msw to new msw in b5-eqiad [19:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:36] RECOVERY - Host copernicium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [19:45:17] (03PS1) 10Ppchelko: Demo: load a config variable from JSON file in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739336 [19:45:22] PROBLEM - Host db1164.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:45:24] PROBLEM - Host mw1395.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:45:24] PROBLEM - Host mw1397.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:45:30] PROBLEM - Host db1179.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:45:44] PROBLEM - Host restbase1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:45:50] PROBLEM - Host wdqs1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:46:00] RECOVERY - Host moss-be1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.82 ms [19:46:02] (03PS3) 10Legoktm: httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) [19:46:38] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={list,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [19:46:57] !log joal@deploy1002 Finished deploy [analytics/refinery@194b11b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@194b11b] (duration: 06m 53s) [19:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:30] (03CR) 10Legoktm: [V: 03+1] "legoktm@cumin1001:~$ httpbb --hosts thumbor1001.eqiad.wmnet --http_port 8800 ~/test_thumbor.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [19:47:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:48:10] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [19:49:18] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:50:38] (03PS2) 10Ppchelko: Demo: load a config variable from JSON file in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739336 [19:51:25] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6003.drmrs.wmnet with OS bullseye [19:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:35] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6002.drmrs.wmnet with OS bullseye [19:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:40] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye completed: - ganeti6003 (**... [19:51:43] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye completed: - ganeti6002 (**... [19:52:06] RECOVERY - Host db1164.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [19:52:06] RECOVERY - Host db1179.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [19:52:34] RECOVERY - Host mw1395.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [19:52:34] RECOVERY - Host mw1397.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [19:52:36] RECOVERY - Host restbase1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [19:52:36] RECOVERY - Host wdqs1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [19:52:45] !log moving mgmt cables from old msw to new msw in b7-eqiad [19:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:42] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [19:55:10] PROBLEM - Host dbprov1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:20] PROBLEM - Host cloudcephmon1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:20] PROBLEM - Host cloudcephosd1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:34] PROBLEM - Host kafka-main1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:57:03] (03CR) 10AOkoth: [C: 03+1] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [19:58:52] PROBLEM - Host mw1401.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:58:52] PROBLEM - Host mw1399.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:58:52] PROBLEM - Host mw1402.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:59:22] RECOVERY - Host mw1399.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [20:00:04] jeena and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T2000). Please do the needful. [20:00:34] RECOVERY - Host dbprov1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [20:00:46] RECOVERY - Host cloudcephosd1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [20:00:46] RECOVERY - Host cloudcephmon1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [20:01:00] RECOVERY - Host kafka-main1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [20:03:29] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) [20:03:46] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) Thanks @thcipriani. I conclude that WMF approval is not required then. [20:03:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcephmon1002, stat1005, cloudcephmon1003, cloudcephmon1001, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [20:03:58] I didn't realize I had left this channel so I can't see the backscroll. I am going to deploy the train now. If there was anything that should hold it up please advise [20:04:24] RECOVERY - Host mw1401.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [20:04:24] RECOVERY - Host mw1402.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [20:04:35] jeena: there are some alerts but it's only maintenance on mgmt, seems clear for the train [20:04:51] thanks mutante [20:04:53] (as long as it stays .mgmt) [20:07:34] (03PS1) 10Jeena Huneidi: group0 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739338 [20:07:36] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739338 (owner: 10Jeena Huneidi) [20:08:17] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739338 (owner: 10Jeena Huneidi) [20:09:26] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.9 refs T293950 [20:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:30] T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950 [20:10:22] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "[cumin1001:~] $ httpbb --hosts thumbor1001.eqiad.wmnet --http_port 8800 /home/legoktm/test_thumbor.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [20:11:16] (03PS3) 10Dzahn: mediawiki/parsoid/wikitech: flip default for font install [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378) [20:13:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:44] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [20:19:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/32452/" [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [20:21:00] (03CR) 10Dzahn: "compiles noop everywhere, just switching the default value to "false" now and removing Hiera lines" [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [20:28:08] (03PS2) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) [20:30:21] (03CR) 10Volans: "Replies inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [20:33:04] (03CR) 10Dzahn: "ahaha, puppet duplicate declaration that is USEFUL - it tells us what else uses php-mysql here so we can't remove that. Duplicate declarat" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:33:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:34] (03PS3) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) [20:36:20] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Improve mailman3 queue alerting - https://phabricator.wikimedia.org/T295805 (10Volans) As a quick fix you could tweak the `check_interval`, `max_check_attempts` and `retry_interval` Icinga parameters that are exposed in `nrpe::monitor_service` as `check_inte... [20:38:10] (03CR) 10Dzahn: "better, but removing the entire scap deploy service, is it used by other sites? https://puppet-compiler.wmflabs.org/compiler1003/32456/mi" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:39:07] !log restarting blazegraph on wdqs1005 (jvm stuck) [20:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:13] (03CR) 10Dzahn: [C: 04-1] "absenting one specific scap::target on a server with multiple scap targets would not just remove one target but break them all because it " [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:42:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:57] (03PS4) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) [20:45:20] (03CR) 10Dzahn: [C: 03+2] "removing (commeting out) but not absenting is the way to go for removing scap::targets https://puppet-compiler.wmflabs.org/compiler1002/32" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:45:27] (03PS5) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) [20:46:03] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/32457/" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:49:55] (03CR) 10Dzahn: "Motd/File[/etc/update-motd.d/05-role-wikimania-scholarships]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:51:09] !log [miscweb2002:/var/cache] $ sudo rm -rf scholarships/ [20:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:27] 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) [20:56:05] (03PS1) 10Dzahn: httpbb/miscweb: drop tests for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037) [20:56:34] (03CR) 10Dzahn: [C: 03+2] httpbb/miscweb: drop tests for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:58:35] (03PS2) 10Dzahn: httpbb/miscweb: drop tests for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037) [20:59:36] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:02:17] (03CR) 10Dzahn: "[cumin1001:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb* --hosts miscweb1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [21:06:37] (03PS1) 10Herron: mailman3_queue_size: increase check intervals [puppet] - 10https://gerrit.wikimedia.org/r/739351 (https://phabricator.wikimedia.org/T295805) [21:18:44] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:21:28] (03PS1) 10Dzahn: acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) [21:23:09] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:24:05] (03PS2) 10Dzahn: acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) [21:24:45] (03PS3) 10Dzahn: acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) [21:24:51] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:27:42] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/32458/" [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:32:30] RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:40:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH) [21:43:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH) [21:53:49] (03CR) 10Juan90264: [C: 03+1] Disable local file upload on the Chinese Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738550 (https://phabricator.wikimedia.org/T295265) (owner: 104nn1l2) [21:58:32] PROBLEM - DNS on mw1448.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.26 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:56] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [22:11:17] (03CR) 10Cwhite: [C: 03+1] role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [22:14:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10Jclark-ctr) @elukey if you can update names when you get a chance Thanks Host Racked waiting on cabling in case something changes ml-train1001 a2... [22:15:51] 10SRE, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Brooke Camarda & Olga Spingou (superset, turnilo, hue) - https://phabricator.wikimedia.org/T295828 (10CGlenn) [22:19:29] (03CR) 10Legoktm: [V: 03+1 C: 03+2] "Shipping, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [22:20:26] (03PS14) 10Brennen Bearnes: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [22:35:38] 10SRE, 10MediaWiki-Categories, 10Russian-Sites, 10Serbian-Sites: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281 (10FriedrickMILBarbarossa) [22:39:04] (03PS1) 10Legoktm: thumbor: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739361 (https://phabricator.wikimedia.org/T285477) [22:39:06] (03PS1) 10Legoktm: conftool: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739362 (https://phabricator.wikimedia.org/T285477) [22:42:01] (03CR) 10Legoktm: [C: 03+2] thumbor: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739361 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [22:42:33] (03CR) 10Dzahn: Added docker::gc class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [22:43:18] (03CR) 10Dzahn: Added docker::gc class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [22:43:22] (03CR) 10Ahmon Dancy: Added docker::gc class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [22:47:31] dancy: so since gitlab-runners project has its own puppetmaster, does it mean changes like this can be applied there before we merge in prod? [22:47:48] was about to compile that [22:47:59] Yes. I did do that with this commit and tested on gitlab-runner1008 [22:48:20] trying to find an instance though that the compiler already knows [22:48:25] and has facts for [22:48:35] alright, in that case.. I will just merge it :) [22:48:46] (03CR) 10Dzahn: [C: 03+2] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy) [22:48:50] Thanks! [22:49:58] done! do you know manually pull on the local master or it just happens? [22:50:06] now [22:50:25] It'll happen manually, but I can pull now to get it moving along [22:50:46] alright, cool [22:53:35] Looks like brennen has something in progress on the puppet master. Waiting for him [22:54:15] gitlab-runners-puppetmaster-01 has Hiera: puppetmaster: gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud [22:54:36] but that does not seem to mean it gets confused about who is its own mater [22:55:20] ack, dancy, no rush [22:55:31] 👍🏾 [22:58:45] (03PS1) 10Dzahn: gitlab-runners: move profile::gitlab::runner::docker_volume: true to repo [puppet] - 10https://gerrit.wikimedia.org/r/739366 [23:00:29] (03PS1) 10Dzahn: gitlab-runners: move puppetmaster setting to repo [puppet] - 10https://gerrit.wikimedia.org/r/739367 [23:06:56] (03PS4) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) [23:13:27] (03PS1) 10Clare Ming: Add new icons, wordmarks, taglines for several wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) [23:14:01] 10SRE, 10Traffic: Image requests sending neither "Last-Modified" nor "ETag" HTTP headers. - https://phabricator.wikimedia.org/T295556 (10Ade56facc) OK, I have seen again responses from server Thumbor without headers named in bug title. I have reloaded web page a few times using key F5 in Chrome browser (which... [23:14:17] (03CR) 10jerkins-bot: [V: 04-1] Add new icons, wordmarks, taglines for several wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [23:17:31] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [23:18:00] (03PS1) 10Dzahn: admin: add Julia Kieserman to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693) [23:18:10] (03CR) 10Legoktm: [C: 03+2] conftool: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739362 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [23:18:31] ryankemper: OK to merge your change? [23:18:45] legoktm: fire away [23:19:01] {{done}} [23:19:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10Dzahn) Thank you, Julia! I uploaded a change to code review. This should continue from there shortly. Cheers, Daniel [23:19:05] ty [23:19:15] (03PS2) 10Clare Ming: Add new icons, wordmarks, taglines for several wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) [23:19:15] !log T276198 `ryankemper@cumin1001:~$ sudo cumin '*elastic*' 'sudo disable-puppet "Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/721644"'` (done a few mins ago) [23:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:19] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [23:20:46] (03PS2) 10Dzahn: admin: add Julia Kieserman to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693) [23:21:11] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1005.eqiad.wmnet [23:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:13] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1005.eqiad.wmnet [23:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:32] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1005.eqiad.wmnet [23:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:06] I pooled it for a minute but depooled because it seemed to be returning 404s for everything [23:23:10] (03CR) 10Dzahn: [V: 03+1] "[mwmaint1002:~] $ ldapsearch -x uid=jkieserman" [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693) (owner: 10Dzahn) [23:25:44] checking the 404s they all seem legit [23:25:51] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1005.eqiad.wmnet [23:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:17] !log legoktm@cumin1001 conftool action : set/weight=5; selector: name=thumbor1005.eqiad.wmnet [23:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:51] !log T276198 `ryankemper@elastic1049:~$ sudo run-puppet-agent --force`; `elasticsearch_6@production-search-eqiad.service ` didn't restart but it looks like there might be slightly wrong with the new `ExecPreStart` line => `Executable path is not absolute, ignoring: systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf` [23:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:54] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [23:37:57] (03PS1) 10Legoktm: Move thumbor1006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739374 [23:40:03] (03PS2) 10Legoktm: Move thumbor1006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739374 (https://phabricator.wikimedia.org/T285477) [23:42:11] (03CR) 10Legoktm: [C: 03+2] Move thumbor1006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739374 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [23:42:53] (03CR) 10Eevans: [C: 03+1] cassandra: move cluster:user relation from 1:1 relation to a 1:many [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [23:43:27] (03PS1) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) [23:43:45] (03PS2) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) [23:44:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [23:46:52] (03PS3) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) [23:49:23] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [23:55:53] (03PS4) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) [23:57:23] (03CR) 10Legoktm: [C: 03+1] elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [23:58:19] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:58:43] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [23:59:38] !log T276198 `ryankemper@elastic1049:~$ sudo run-puppet-agent --force` to test out https://gerrit.wikimedia.org/r/c/operations/puppet/+/739375 [23:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:42] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198