[00:00:05] <jouncebot>	 RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T0000).
[00:00:05] <jouncebot>	 nray and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:21] <nray>	 here o/
[00:01:13] <tgr>	 o/
[00:01:14] <urbanecm>	 Hey everyone
[00:01:17] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:01:33] <urbanecm>	 tgr: want to do yours? :)
[00:02:03] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:02:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] MobileWebUIActions tracks init event [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738399 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson)
[00:02:20] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] We need some way to distinguish namespaces [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/739004 (https://phabricator.wikimedia.org/T294738) (owner: 10Nray)
[00:02:32] <tgr>	 urbanecm: would you mind doing it? they just need the rebase
[00:02:44] <urbanecm>	 tgr: sure thing :)
[00:04:08] <urbanecm>	 Can't say whether the format of the config is correct though :)
[00:04:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [beta] Disable GrowthExperiments Add Link on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739023 (owner: 10Gergő Tisza)
[00:05:11] <tgr>	 actually, the non-beta patch probably needs sync-file, since it will be read by the next branch, and it might not get automatically synced before then?
[00:05:19] <urbanecm>	 Yup
[00:05:31] <urbanecm>	 But should be no op for prod otherwise
[00:05:36] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Disable GrowthExperiments Add Link on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739023 (owner: 10Gergő Tisza)
[00:05:40] <tgr>	 yeah, nothing using it right now
[00:06:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan)
[00:06:09] <wikibugs>	 (03PS3) 10Urbanecm: labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan)
[00:06:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan)
[00:06:59] <wikibugs>	 (03Merged) 10jenkins-bot: MobileWebUIActions tracks init event [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738399 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson)
[00:07:01] <wikibugs>	 (03Merged) 10jenkins-bot: We need some way to distinguish namespaces [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/739004 (https://phabricator.wikimedia.org/T294738) (owner: 10Nray)
[00:07:04] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan)
[00:07:37] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:07:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:25] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:11:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:11:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:55] <urbanecm>	 the wmf backports were quicker than expected
[00:13:43] <urbanecm>	 nray: pulled to mwdebug1001, can you test please?
[00:14:01] <nray>	 yes, thank you. are both of the patches on that server?
[00:14:07] <urbanecm>	 affirmative
[00:14:10] <nray>	 cool, checking
[00:17:18] <wikibugs>	 (03PS3) 10Urbanecm: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan)
[00:17:22] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan)
[00:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan)
[00:19:40] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 50d9f2687cd11e6f838313a530c6bbd498d0b83e: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform (T294737) (duration: 00m 56s)
[00:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:44] <stashbot>	 T294737: Add an image: experiment - https://phabricator.wikimedia.org/T294737
[00:19:51] <urbanecm>	 tgr: your patches should be merged/synced :)
[00:19:59] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[00:19:59] <tgr>	 thanks urbanecm!
[00:20:25] <nray>	 @urbanecm you may proceed. I'll be monitoring our event logging graphs after you deploy to make sure we don't get extreme spikes
[00:20:28] <tgr>	 I'll test in beta once the corresponding extension patch is merged
[00:21:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:07] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[00:22:14] <urbanecm>	 nray: perfect, syncing
[00:23:44] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/WikimediaEvents/: 738399: 739004: WikimediaEvents backports (T294738) (duration: 00m 56s)
[00:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:48] <stashbot>	 T294738: Define and instrument bounce rate on talk pages - https://phabricator.wikimedia.org/T294738
[00:23:50] <urbanecm>	 nray: and live
[00:23:52] <urbanecm>	 anything else?
[00:24:02] <nray>	 thats it. Thank you!
[00:24:22] <urbanecm>	 any time!
[00:25:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:33] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:27:42] <wikibugs>	 (03PS1) 10Legoktm: httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477)
[00:28:52] <urbanecm>	 !log UTC late window done
[00:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[00:35:15] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:36:59] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[00:38:09] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:38:21] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:39:29] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:09:41] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:16:07] <wikibugs>	 (03PS1) 10Gergő Tisza: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032
[01:19:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "-labs looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (owner: 10Gergő Tisza)
[01:22:07] <wikibugs>	 (03PS1) 10Ebernhardson: Add CirrusSearch Old GC Hell alerting [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604)
[01:23:50] <wikibugs>	 (03CR) 10Gergő Tisza: GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (owner: 10Gergő Tisza)
[01:25:49] <tgr>	 ^ I have a late followup to the deploy window.
[01:30:49] <wikibugs>	 (03PS2) 10Gergő Tisza: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737)
[01:39:19] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:06:55] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043
[02:06:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot)
[02:07:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:09] <wikibugs>	 (03CR) 10RLazarus: "Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[02:10:37] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:10:39] <wikibugs>	 (03CR) 10RLazarus: httpbb: Add some tests for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[02:10:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot)
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T0300)
[04:32:37] <icinga-wm>	 PROBLEM - ElasticSearch shard size check - 9200 on logstash1035 is CRITICAL: CRITICAL - logstash-mediawiki-2021.11.14(383.6666666666667gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed
[05:20:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:51:21] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcephmon1001, cloudcephmon1003, cloudcontrol1005, cloudcontrol1003, cloudcephmon1002, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[06:27:37] <wikibugs>	 (03CR) 10Legoktm: httpbb: Add some tests for thumbor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[06:27:54] <wikibugs>	 (03PS2) 10Legoktm: httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477)
[06:28:02] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[06:34:15] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:34:45] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:10:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) @Jclark-ctr Hi! Before proceeding with the nodes do you mind to ping me or my team first? We are thinking of changing name to reflect the fact...
[07:19:54] <wikibugs>	 (03CR) 10Elukey: Configure stat servers to use /srv/spark-tmp as spark.local.dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[07:25:10] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS buster
[07:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:20] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster
[07:27:19] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:27:35] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:27:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:32:09] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/6 UP : OSPFv3: 4/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:35:27] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:35:41] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:35:51] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:36:13] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:36:46] <wikibugs>	 (03CR) 10Elukey: "I think this is a great step in the right direction, thanks a lot for working on it! Left some comments :)" [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah)
[07:52:13] <wikibugs>	 10SRE, 10ops-drmrs: Degraded RAID on cp6002 - https://phabricator.wikimedia.org/T295747 (10ops-monitoring-bot)
[08:02:16] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcephmon1003, cloudcephmon1001, cloudcontrol1005, cloudcephmon1002, cloudcontrol1003, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[08:04:16] <wikibugs>	 (03PS2) 10Muehlenhoff: admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond)
[08:04:33] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6002.drmrs.wmnet with OS buster
[08:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster completed: - cp6002 (**WARN**)...
[08:05:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond)
[08:06:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:07:43] <wikibugs>	 (03PS3) 10Muehlenhoff: admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond)
[08:10:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] admin: Remove access for jmixter [puppet] - 10https://gerrit.wikimedia.org/r/737864 (owner: 10Jbond)
[08:13:08] <wikibugs>	 (03PS5) 10Muehlenhoff: Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722)
[08:14:41] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6003.drmrs.wmnet with OS buster
[08:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster
[08:18:28] <wikibugs>	 (03PS5) 10Ema: varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879)
[08:18:42] <wikibugs>	 (03CR) 10Ema: varnish: add varnishmtail-wrapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[08:24:51] <wikibugs>	 (03PS4) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737)
[08:41:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] _get_junos_router_interfaces: ignore VCP interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/738905 (owner: 10Ayounsi)
[08:41:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] test_interface_termination_names: add breakout cables support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738913 (owner: 10Ayounsi)
[08:52:51] <wikibugs>	 (03CR) 10Mbch331: Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[08:52:54] <wikibugs>	 (03PS3) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836)
[08:53:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[08:54:34] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6003.drmrs.wmnet with OS buster
[08:54:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6003.drmrs.wmnet with OS buster completed: - cp6003 (**WARN**)...
[09:05:29] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-1] GrowthExperiments configuration fixes (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza)
[09:09:38] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6004.drmrs.wmnet with OS buster
[09:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:48] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster
[09:15:09] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Helm chart dependencies no longer in requitements.yaml - https://phabricator.wikimedia.org/T295750 (10JMeybohm)
[09:15:32] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Helm chart dependencies no longer in requirements.yaml - https://phabricator.wikimedia.org/T295750 (10JMeybohm)
[09:18:15] <wikibugs>	 (03PS3) 10Gergő Tisza: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737)
[09:19:29] <wikibugs>	 (03PS1) 10Majavah: Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111
[09:19:48] <wikibugs>	 (03CR) 10Gergő Tisza: GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza)
[09:20:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah)
[09:33:24] <wikibugs>	 (03PS1) 10Jgiannelos: tile-pregeneration: Fix argument order for batching [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115
[09:38:47] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] _get_junos_router_interfaces: ignore VCP interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/738905 (owner: 10Ayounsi)
[09:38:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 58, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:39:13] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [homer/deploy@c570af3]: Homer CR738905
[09:39:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:39] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [homer/deploy@c570af3]: Homer CR738905 (duration: 01m 25s)
[09:40:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:46] <wikibugs>	 (03PS4) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346)
[09:45:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph::auth::keyring: Generate keyring_path if not passed [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[09:46:01] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza)
[09:46:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 60, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:46:55] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32428/console" [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[09:47:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph::auth::keyring: allow passing the full client name [puppet] - 10https://gerrit.wikimedia.org/r/738903 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[09:48:56] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6004.drmrs.wmnet with OS buster
[09:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:06] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6004.drmrs.wmnet with OS buster completed: - cp6004 (**WARN**)...
[09:50:44] <wikibugs>	 (03PS2) 10David Caro: ceph::auth::keyring: allow passing the full client name [puppet] - 10https://gerrit.wikimedia.org/r/738903 (https://phabricator.wikimedia.org/T293752)
[09:50:50] <wikibugs>	 (03PS2) 10David Caro: ceph::auth::keyring: Generate keyring_path if not passed [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752)
[09:51:02] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[09:51:07] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6005.drmrs.wmnet with OS buster
[09:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:16] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster
[09:52:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:52:57] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32429/console" [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[09:53:30] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) OK! We don't really care about the OS drive size. What's important here is the extra drive for LVM, which should have at least 20G.  You create the VMs or I do? I never did it b...
[09:54:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:54:08] <wikibugs>	 (03PS5) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346)
[09:55:21] <wikibugs>	 (03PS1) 10Jgiannelos: tegola-vector-tiles: Disable debugging on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/739118
[09:55:29] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32430/console" [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[09:57:27] <wikibugs>	 (03CR) 10Btullis: "PCC looks better this time." [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[09:58:00] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce role for cloudbackup-dev [puppet] - 10https://gerrit.wikimedia.org/r/738376 (https://phabricator.wikimedia.org/T295584)
[09:58:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[10:02:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan)
[10:02:28] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[10:02:30] <ema>	 !log A:cp disable puppet to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/738910 on cp4021 T293879
[10:02:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:34] <stashbot>	 T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough  - https://phabricator.wikimedia.org/T293879
[10:02:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-srpeadcheck-tools: add new shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro)
[10:03:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add k8s 1.21 to stretch too [puppet] - 10https://gerrit.wikimedia.org/r/738912 (https://phabricator.wikimedia.org/T282942) (owner: 10Majavah)
[10:04:52] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[10:06:18] <arturo>	 !log updating deb packages on stretch-wikimedia/thirdparty/kubeadm-k8s-1-21 (T282942)
[10:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:23] <stashbot>	 T282942: Upgrade Toolforge Kubernetes to latest 1.21 - https://phabricator.wikimedia.org/T282942
[10:07:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] WMCS haproxy: set expose-fd listeners for all services [puppet] - 10https://gerrit.wikimedia.org/r/737986 (owner: 10Andrew Bogott)
[10:08:01] <wikibugs>	 (03PS2) 10JMeybohm: Fix helm3 lint errors and helm dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738980
[10:08:03] <wikibugs>	 (03PS3) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857
[10:08:05] <wikibugs>	 (03PS1) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122
[10:08:40] <wikibugs>	 (03PS1) 10Jgiannelos: tile-pregeneration: Make script less verbose [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739123
[10:09:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) @cmooney, I agree with your take on the security aspect.  We're not in a typical service provider (ISP)/customer relations...
[10:10:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) (owner: 10Majavah)
[10:14:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2] Run helmfile commands against the local version of the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[10:14:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:15:10] <moritzm>	 !log installing testvm2001
[10:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-2] "I'm still not sure why this chain ends up producing a 100% diff for echostore, sessionstore and toolhub" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm)
[10:20:00] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[10:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:08] <ema>	 !log A:cp re-enable puppet after successful test on cp402[17] T293879
[10:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:11] <stashbot>	 T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough  - https://phabricator.wikimedia.org/T293879
[10:24:42] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "All PCC changes were expected (only parameters, no actual resources)" [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[10:24:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph::auth::keyring: allow passing the full client name [puppet] - 10https://gerrit.wikimedia.org/r/738903 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[10:25:36] <wikibugs>	 (03PS1) 10Kormat: db1112: Re-enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/739126 (https://phabricator.wikimedia.org/T294295)
[10:26:11] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10MoritzMuehlenhoff) >>! In T295584#7505985, @aborrero wrote: > OK! We don't really care about the OS drive size. What's important here is the extra drive for LVM, which should have at leas...
[10:26:33] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] db1112: Re-enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/739126 (https://phabricator.wikimedia.org/T294295) (owner: 10Kormat)
[10:29:07] <wikibugs>	 10SRE, 10ops-drmrs, 10Traffic: Degraded RAID on cp6002 - https://phabricator.wikimedia.org/T295747 (10Peachey88)
[10:30:59] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6005.drmrs.wmnet with OS buster
[10:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6005.drmrs.wmnet with OS buster completed: - cp6005 (**WARN**)...
[10:35:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) From a certain point of view what we're doing here is validating [[https://wikitech.wikimedia.org/wiki/Cross-Realm_traffi...
[10:39:42] <wikibugs>	 (03PS1) 10David Caro: ceph::auth: require load_all when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752)
[10:40:38] <icinga-wm>	 PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[10:41:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ceph::auth: require load_all when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[10:42:56] <icinga-wm>	 PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[10:43:00] <wikibugs>	 (03PS2) 10David Caro: ceph::auth: require load_all when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752)
[10:43:11] <wikibugs>	 (03CR) 10Majavah: P::kerberos: automate principal management (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah)
[10:45:44] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: disable debug logging outside of staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717)
[10:48:55] <wikibugs>	 (03PS3) 10David Caro: p:{osd,backup_glance_images,backy2}: require ceph::auth::deploy when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752)
[10:50:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] p:{osd,backup_glance_images,backy2}: require ceph::auth::deploy when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[10:51:41] <wikibugs>	 (03CR) 10Muehlenhoff: "Didn't find the time yet to read through it at large yet, but one comment line." [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah)
[10:52:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow mtail to match all handlers [puppet] - 10https://gerrit.wikimedia.org/r/738918 (owner: 10Giuseppe Lavagetto)
[10:56:16] <icinga-wm>	 RECOVERY - RPKI Validator RTR port on rpki2001 is OK: TCP OK - 0.032 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[10:59:36] <wikibugs>	 (03PS4) 10David Caro: p:{osd,b_g_images,backy2}: require c::a::deploy when checking keyrings [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752)
[10:59:36] <icinga-wm>	 PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[11:01:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There is an error in the monitoring check." [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[11:01:33] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32433/console" [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[11:01:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] service/miscweb: switch state from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/694628 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[11:03:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[11:03:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[11:03:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:55] <wikibugs>	 (03Abandoned) 10Awight: Remove deprecated QuickSurveys config fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604895 (owner: 10Awight)
[11:07:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Fix helm3 lint errors and helm dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738980 (owner: 10JMeybohm)
[11:07:33] <wikibugs>	 (03CR) 10Awight: "I recommend we use "layout" everywhere and make it a mandatory field." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604681 (https://phabricator.wikimedia.org/T255130) (owner: 10Awight)
[11:07:37] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739127 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro)
[11:07:56] <wikibugs>	 (03CR) 10JMeybohm: Auto add helm chart repositories (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm)
[11:08:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[11:09:03] <wikibugs>	 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10hnowlan) This happens because of how DEPLOY_HEAD retains the last-used deploy server name and unless explicitly told to ignore, it w...
[11:10:41] <wikibugs>	 (03CR) 10Awight: "I learned that this class is adapted from https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/SiteC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE))
[11:11:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Update approver for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff)
[11:12:09] <wikibugs>	 (03Merged) 10jenkins-bot: Fix helm3 lint errors and helm dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738980 (owner: 10JMeybohm)
[11:13:47] <wikibugs>	 (03PS10) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914
[11:14:33] <wikibugs>	 (03PS2) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122
[11:20:27] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) p:05Triage→03Medium a:03aborrero >>! In T295584#7506073, @MoritzMuehlenhoff wrote: >>>! In T295584#7505985, @aborrero wrote: >> OK! We do...
[11:26:16] <icinga-wm>	 RECOVERY - Routinator process on rpki2001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[11:31:39] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons.
[11:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:24] <icinga-wm>	 PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[11:33:46] <wikibugs>	 (03CR) 10Majavah: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah)
[11:34:04] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6006.drmrs.wmnet with OS buster
[11:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster
[11:36:03] <wikibugs>	 (03PS1) 10Jbond: P:netbox::scripts: use role_hosts to get ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/739139
[11:36:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32434/console" [puppet] - 10https://gerrit.wikimedia.org/r/739139 (owner: 10Jbond)
[11:40:35] <wikibugs>	 (03PS3) 10Majavah: Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111
[11:40:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Autonyms look okay now, but the commonswiki part is still missing. (Also, needs a rebase apparently. I guess some more language codes were" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[11:45:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[11:46:13] <wikibugs>	 (03PS1) 10Hashar: ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719)
[11:46:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: introduce role for cloudbackup-dev [puppet] - 10https://gerrit.wikimedia.org/r/738376 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez)
[11:47:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) (owner: 10Hashar)
[11:47:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) a:03cmooney So of course there is a complication.  Currently we have a single BGP session between adjacent CR routers, peered over the loopback IPv4 addresses either si...
[11:49:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox::scripts: use role_hosts to get ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/739139 (owner: 10Jbond)
[11:50:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Tested on netbox-next on all devices." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738274 (owner: 10Volans)
[11:50:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[11:50:59] <wikibugs>	 (03Merged) 10jenkins-bot: scripts: clean temporary code from PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738274 (owner: 10Volans)
[11:51:06] <wikibugs>	 (03PS2) 10Hashar: ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719)
[11:52:54] <wikibugs>	 (03PS11) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914
[11:54:32] <wikibugs>	 (03PS3) 10Hashar: ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719)
[11:55:30] <moritzm>	 !log failover ganeti master in test cluster to ganeti-test2002
[11:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:11] <icinga-wm>	 RECOVERY - RPKI Validator RTR port on rpki2001 is OK: TCP OK - 0.034 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[11:59:17] <icinga-wm>	 PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1200).
[12:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[12:00:20] <Lucas_WMDE>	 that’s good, because I’m off for lunch in a moment ^^
[12:00:42] <Lucas_WMDE>	 (there’s a Wikibase-related config change in the pipeline but it needs a bit more work anyways)
[12:02:21] <wikibugs>	 (03PS4) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836)
[12:03:38] * urbanecm waves anyway, in case a deployer's needed
[12:05:46] <wikibugs>	 (03PS1) 10Volans: scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178
[12:06:13] <wikibugs>	 (03CR) 10Mbch331: "Now commons and Wikidata should be in sync and I've rebased the code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[12:13:32] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6006.drmrs.wmnet with OS buster
[12:13:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6006.drmrs.wmnet with OS buster completed: - cp6006 (**WARN**)...
[12:14:59] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "lgtm," [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) (owner: 10Ebernhardson)
[12:17:11] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:17:16] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Reorganize backups so we move s1 and s2 into dbprovX001 [puppet] - 10https://gerrit.wikimedia.org/r/739217 (https://phabricator.wikimedia.org/T280979)
[12:19:47] <wikibugs>	 (03PS3) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879)
[12:20:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I agree with the idea, but implementation needs a bit of work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[12:20:47] <wikibugs>	 (03CR) 10Hashar: "Cherry picked on integration-puppetmaster02 and confirmed to work." [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) (owner: 10Hashar)
[12:21:28] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[12:22:25] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10Volans) 05Open→03In progress p:05Triage→03Medium
[12:22:51] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons.
[12:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:08] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[12:24:08] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[12:24:30] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6007.drmrs.wmnet with OS buster
[12:24:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster
[12:25:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] GrowthExperiments configuration fixes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza)
[12:26:11] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752)
[12:26:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10MoritzMuehlenhoff) 05In progress→03Resolved The new Ganeti test cluster has been setup: It consists of three nodes in row A of codfw (ganeti-test200[1-3].codfw.wmnet). A test...
[12:26:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[12:29:41] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "+, wgGENewcomerTasksLinkRecommendationsEnabled is no longer prefixed with a hyphen? Is that intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza)
[12:29:49] <moritzm>	 !log installing Linux 4.19.208 updates on buster hosts (no reboots)
[12:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:07] <wikibugs>	 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans)
[12:31:21] <wikibugs>	 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) p:05Triage→03Medium
[12:31:23] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Set wgGENewcomerTasksLinkRecommendationsEnabled to false everywhere but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739225
[12:33:20] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [beta] Set wgGENewcomerTasksLinkRecommendationsEnabled to false everywhere but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739225 (owner: 10Urbanecm)
[12:33:56] <wikibugs>	 (03PS1) 10Ema: prometheus:ops: add varnishmtail-internal jobs [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879)
[12:34:02] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Set wgGENewcomerTasksLinkRecommendationsEnabled to false everywhere but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739225 (owner: 10Urbanecm)
[12:36:50] <wikibugs>	 (03PS4) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879)
[12:36:52] <wikibugs>	 (03PS2) 10Ema: prometheus:ops: add varnishmtail-internal jobs [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879)
[12:37:36] <wikibugs>	 (03PS12) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914
[12:38:25] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema)
[12:38:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove PHP 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739006 (owner: 10Legoktm)
[12:41:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[12:41:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[12:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:30] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:54:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Jelto)
[12:54:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto)
[12:54:18] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:57:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups so we move s1 and s2 into dbprovX001 [puppet] - 10https://gerrit.wikimedia.org/r/739217 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo)
[12:57:37] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: introduce datatype for configuration hash [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752)
[12:58:05] <wikibugs>	 (03PS5) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879)
[12:58:07] <wikibugs>	 (03PS1) 10Ema: varnish: remove internal mtail scripts from default instance [puppet] - 10https://gerrit.wikimedia.org/r/739229 (https://phabricator.wikimedia.org/T293879)
[12:58:56] <wikibugs>	 10SRE-Access-Requests: Requesting access to releasers-wikibase for Rosalie_WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek)
[12:59:54] <wikibugs>	 10SRE-Access-Requests: Requesting access to releasers-wikibase for Rosalie_WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) Looking at the previous request of this kind (T269777) I am unclear who should be requested to approve on WMF's end? @thcipriani ? Please advise, thank you.  I approve this...
[13:03:18] <wikibugs>	 10SRE-Access-Requests, 10Wikibase Release Strategy: Requesting access to releasers-wikibase for Rosalie_WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek)
[13:03:56] <icinga-wm>	 PROBLEM - Check systemd state on ping2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:33] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6007.drmrs.wmnet with OS buster
[13:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:42] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6007.drmrs.wmnet with OS buster completed: - cp6007 (**WARN**)...
[13:04:50] <icinga-wm>	 PROBLEM - Check systemd state on ping1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:03] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752)
[13:05:38] <moritzm>	 !log installing psmisc bugfix updates on buster hosts
[13:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[13:08:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff)
[13:14:06] <icinga-wm>	 PROBLEM - Disk space on ping2001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=66%): /tmp 0 MB (0% inode=66%): /var/tmp 0 MB (0% inode=66%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops
[13:15:21] <wikibugs>	 (03PS2) 10Hnowlan: partmon: add reuse partmon profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375)
[13:18:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10MoritzMuehlenhoff)
[13:18:33] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752)
[13:18:43] <moritzm>	 !log prune unused packages from ping1001/ping2001 T295767
[13:18:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:46] <stashbot>	 T295767: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767
[13:19:40] <icinga-wm>	 PROBLEM - Disk space on ping3001 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=66%): /tmp 1 MB (0% inode=66%): /var/tmp 1 MB (0% inode=66%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops
[13:19:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] ceph: auth: introduce datatype for configuration hash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[13:20:28] <icinga-wm>	 RECOVERY - Check systemd state on ping1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:21:42] <moritzm>	 !log prune unused packages from ping3001 T295767
[13:21:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:47] <moritzm>	 !log installing debconf bugfix updates on buster
[13:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:50] <icinga-wm>	 RECOVERY - Routinator process on rpki2001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[13:29:12] <icinga-wm>	 PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[13:30:57] <wikibugs>	 (03PS1) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234
[13:32:13] <wikibugs>	 10SRE-Access-Requests, 10Wikibase Release Strategy: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek)
[13:32:38] <majavah>	 XioNoX: topranks: fyi, routinator checks on rpki2001 seems to be flapping ^^
[13:32:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff)
[13:33:02] <topranks>	 majavah:  thanks, looking now
[13:33:21] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[13:34:02] <icinga-wm>	 RECOVERY - RPKI Validator RTR port on rpki2001 is OK: TCP OK - 0.032 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[13:34:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond)
[13:34:16] <icinga-wm>	 RECOVERY - Routinator process on rpki2001 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[13:34:24] <icinga-wm>	 RECOVERY - Disk space on ping2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops
[13:34:49] <topranks>	 "No space left on device"
[13:35:06] <topranks>	 Ironically I was going to rebuild it later today to add more space.
[13:35:20] <moritzm>	 you jinxed it :-)
[13:35:27] <topranks>	 lol yep!
[13:35:43] <wikibugs>	 (03PS4) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857
[13:35:45] <wikibugs>	 (03PS3) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122
[13:35:53] <topranks>	 Anyway I'll ack the alert for now and then do just that, no point faffing about trying to free space on the existing one.
[13:35:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff)
[13:35:56] <wikibugs>	 (03CR) 10JMeybohm: Run helmfile commands against the local version of the chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[13:36:02] <topranks>	 majavah: thanks for the heads up :)
[13:36:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm)
[13:36:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[13:36:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff)
[13:36:18] <icinga-wm>	 PROBLEM - Host cp2027 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:02] <icinga-wm>	 ACKNOWLEDGEMENT - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw Cathal Mooney Ran out of disk, rebuilding with a bigger one. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:38:14] <icinga-wm>	 RECOVERY - Host cp2027 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[13:38:18] <icinga-wm>	 PROBLEM - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[13:38:34] <icinga-wm>	 PROBLEM - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[13:39:47] <icinga-wm>	 ACKNOWLEDGEMENT - RPKI Validator RTR port on rpki2001 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused Cathal Mooney Ran out of space, will rebuild with more. https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[13:40:02] <icinga-wm>	 ACKNOWLEDGEMENT - Routinator process on rpki2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator Cathal Mooney Ran out of space, will rebuild with more. https://wikitech.wikimedia.org/wiki/RPKI%23Process
[13:40:16] <icinga-wm>	 RECOVERY - Disk space on ping3001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops
[13:42:10] <wikibugs>	 (03PS7) 10Muehlenhoff: Obsolete role::restbase::base [puppet] - 10https://gerrit.wikimedia.org/r/729943
[13:48:36] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: ceph: auth: introduce datatype for configuration hash [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752)
[13:48:38] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752)
[13:48:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: libvirt: migrate to new ceph auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752)
[13:49:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[13:49:56] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/729943 (owner: 10Muehlenhoff)
[13:51:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: libvirt: migrate to new ceph auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[13:51:38] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.ganeti.makevm for new host rpki2001.codfw.wmnet
[13:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:26] <icinga-wm>	 RECOVERY - Check systemd state on ping2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:56] <wikibugs>	 10SRE, 10Tracking-Neverending: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Jelto)
[13:54:31] <wikibugs>	 10SRE, 10Tracking-Neverending: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Jelto) p:05Triage→03Low
[13:55:32] <wikibugs>	 (03CR) 10Volans: (WIP) initial cookbook for syncing netbox puppet data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond)
[13:57:30] <wikibugs>	 (03PS1) 10Cathal Mooney: Update IP address for RPKI Validator session to rpki2001 [homer/public] - 10https://gerrit.wikimedia.org/r/739237 (https://phabricator.wikimedia.org/T292503)
[13:58:57] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host rpki2001.codfw.wmnet
[13:58:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:11] <wikibugs>	 (03PS13) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914
[14:01:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10JKieserman) Hey Daniel,  Yes sorry about that!  I'm a software engineer on the abstract team, reporting to Cai Blanton. Let me know what other information would be useful!  Cheers, Julia
[14:05:22] <wikibugs>	 10SRE, 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10Volans)
[14:05:25] <wikibugs>	 (03PS14) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914
[14:05:54] <wikibugs>	 (03CR) 10Ayounsi: "The change itself lgtm, but the IP doesn't have a DNS record." [homer/public] - 10https://gerrit.wikimedia.org/r/739237 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney)
[14:05:58] <wikibugs>	 (03CR) 10David Caro: cloud: ceph: libvirt: migrate to new ceph auth abstraction (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[14:06:33] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] ceph: auth: introduce datatype for configuration hash [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[14:08:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans)
[14:08:53] <wikibugs>	 (03PS5) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836)
[14:09:19] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hosts.decommission for hosts rpki2001.codfw.wmnet
[14:09:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto)
[14:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:22] <wikibugs>	 (03PS8) 10Muehlenhoff: Obsolete role::restbase::base [puppet] - 10https://gerrit.wikimedia.org/r/729943
[14:11:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans)
[14:12:55] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: reconstruct gitlab sidekiq message field [puppet] - 10https://gerrit.wikimedia.org/r/739018 (https://phabricator.wikimedia.org/T295731) (owner: 10Cwhite)
[14:14:49] <wikibugs>	 (03PS2) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234
[14:15:05] <wikibugs>	 10SRE, 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: cloudnet VLAN Netbox discrepancies - https://phabricator.wikimedia.org/T295776 (10Volans)
[14:15:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/729943 (owner: 10Muehlenhoff)
[14:17:54] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Disable debugging on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/739118 (owner: 10Jgiannelos)
[14:18:00] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rpki2001.codfw.wmnet
[14:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin2002 for hosts: `rpki2001.codfw.wmnet` - rpki2001.codfw.wmnet (**PAS...
[14:18:20] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Make script less verbose [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739123 (owner: 10Jgiannelos)
[14:18:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond)
[14:19:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Update approver for gitlab-roots/vrts-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff)
[14:19:10] <wikibugs>	 10SRE, 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10elukey) The only recent thing that I recall is T276239, but not for all workers mentioned. I checked quickly the dry-run for...
[14:19:24] <wikibugs>	 (03Merged) 10jenkins-bot: tile-pregeneration: Make script less verbose [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739123 (owner: 10Jgiannelos)
[14:20:09] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Fix argument order for batching [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115 (owner: 10Jgiannelos)
[14:21:40] <wikibugs>	 (03Merged) 10jenkins-bot: tile-pregeneration: Fix argument order for batching [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115 (owner: 10Jgiannelos)
[14:22:26] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: Disable debugging on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/739118 (owner: 10Jgiannelos)
[14:22:36] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.ganeti.makevm for new host rpki2002.codfw.wmnet
[14:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Update approver for os-installers [puppet] - 10https://gerrit.wikimedia.org/r/738837
[14:23:31] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[14:23:48] <wikibugs>	 (03CR) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond)
[14:24:03] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Update IP address for RPKI Validator session to rpki2001 [homer/public] - 10https://gerrit.wikimedia.org/r/739237 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney)
[14:24:49] <jynus>	 !log re-adding backup user to db1108:analytics_meta T284150
[14:24:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:53] <stashbot>	 T284150: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150
[14:25:55] <wikibugs>	 (03PS1) 10Elukey: sre.druid.roll-restart-workers: restart Druid exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/739240
[14:26:03] <wikibugs>	 (03PS3) 10Jbond: (WIP) initial cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234
[14:27:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update approver for os-installers [puppet] - 10https://gerrit.wikimedia.org/r/738837 (owner: 10Muehlenhoff)
[14:27:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) >>! In T293253#7504047, @DAbad wrote: > Public Key: AAAAC3NzaC1lZDI1NTE5AAAAIEMCL89wONrqDKRSFKETmGNyQ5OCPlZWjDpYODpBXOMg   Could you check your pasted ssh public key ag...
[14:30:14] <wikibugs>	 (03PS3) 10Muehlenhoff: Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836
[14:30:35] <wikibugs>	 (03PS1) 10Jgiannelos: maps: Make silent cURL requests on tile invalidation [puppet] - 10https://gerrit.wikimedia.org/r/739241
[14:30:37] <wikibugs>	 (03CR) 10Muehlenhoff: Update approvers for gitlab-roots/vrts-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff)
[14:31:08] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host rpki2002.codfw.wmnet
[14:31:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff)
[14:31:23] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff)
[14:31:35] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.ganeti.makevm for new host rpki2001.codfw.wmnet
[14:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:24] <wikibugs>	 (03CR) 10Jgiannelos: "I checked the cronjob logs and the tile invalidation cURL requests are a bit noisy. This patch makes them silent and only show errors." [puppet] - 10https://gerrit.wikimedia.org/r/739241 (owner: 10Jgiannelos)
[14:32:57] <wikibugs>	 (03PS4) 10Muehlenhoff: Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836
[14:33:41] <wikibugs>	 (03CR) 10Jgiannelos: "recheck" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739115 (owner: 10Jgiannelos)
[14:34:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update approvers for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff)
[14:39:36] <wikibugs>	 10SRE, 10Tracking-Neverending: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez LE staging environment had a rough time :) It's fixed now
[14:39:38] <wikibugs>	 10SRE, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Vgutierrez)
[14:44:22] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host rpki2001.codfw.wmnet
[14:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Run helmfile commands against the local version of the chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[14:47:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:50:19] <wikibugs>	 (03PS1) 10Cathal Mooney: Updating MAC address in DHCP config for rpki2001 [puppet] - 10https://gerrit.wikimedia.org/r/739242 (https://phabricator.wikimedia.org/T292503)
[14:51:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Updating MAC address in DHCP config for rpki2001 [puppet] - 10https://gerrit.wikimedia.org/r/739242 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney)
[14:52:10] <wikibugs>	 (03PS1) 10Jbond: P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905)
[14:52:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans)
[14:52:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond)
[14:53:14] <wikibugs>	 (03Merged) 10jenkins-bot: scripts: allow to remove iface from VMs too [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/739178 (owner: 10Volans)
[14:56:01] <wikibugs>	 (03PS2) 10Jbond: P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905)
[14:56:54] <wikibugs>	 (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/739267
[14:59:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) 05Open→03Resolved
[14:59:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32438/console" [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond)
[14:59:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: manually deploy the root CA in cloud [puppet] - 10https://gerrit.wikimedia.org/r/739266 (https://phabricator.wikimedia.org/T291905) (owner: 10Jbond)
[15:04:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10ayounsi) For the sake of completeness, another option could be to add the fffff:<v4> IP to the loopback address, but that would be more of a workaround than a long term solution....
[15:06:12] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/739267 (owner: 10Jgiannelos)
[15:09:05] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:10:07] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:10:29] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/739267 (owner: 10Jgiannelos)
[15:12:25] <wikibugs>	 (03PS6) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836)
[15:13:46] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10Volans)
[15:13:54] <wikibugs>	 10SRE, 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: cloudnet VLAN Netbox discrepancies - https://phabricator.wikimedia.org/T295776 (10Volans) 05Open→03Resolved a:03Volans After verifying that the changes were all expected and the VLAN bits were actually an artifact of how...
[15:14:23] <wikibugs>	 (03CR) 10Mbch331: Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[15:15:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Jgreen) @ayounsi I think it would be fine to do the codfw pfw's this year. Please ping me on IRC when you have some time to discuss.
[15:21:09] <wikibugs>	 (03PS4) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005)
[15:21:57] <wikibugs>	 10SRE, 10SRE-tools, 10Analytics, 10Infrastructure-Foundations, 10netops: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) Yes I thought this was a bit odd. I saw there was a bit of re-imaging here: T231067#6891049 but that was before my t...
[15:22:14] <wikibugs>	 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) a:03BTullis
[15:22:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:24:15] <wikibugs>	 (03PS1) 10Elukey: Revert "Update deployment-prep's profile::base::certificates settings" [puppet] - 10https://gerrit.wikimedia.org/r/739260
[15:24:32] <wikibugs>	 (03Abandoned) 10Elukey: Revert "Update deployment-prep's profile::base::certificates settings" [puppet] - 10https://gerrit.wikimedia.org/r/739260 (owner: 10Elukey)
[15:26:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1) The linked patch does not seem related. Has this been copied from somewhere?
[15:27:10] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] maps: Make silent cURL requests on tile invalidation [puppet] - 10https://gerrit.wikimedia.org/r/739241 (owner: 10Jgiannelos)
[15:28:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: auth: introduce datatype for configuration hash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[15:30:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC noop https://puppet-compiler.wmflabs.org/compiler1002/32440/" [puppet] - 10https://gerrit.wikimedia.org/r/739228 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[15:32:49] <icinga-wm>	 PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:33:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) Hello Carol  We received an access request from Desiree Abad to the group analytics-privatedata-users. Desiree wants to work on Analytics & Metrics Platform to service...
[15:33:41] <wikibugs>	 (03CR) 10Eigyan: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR)
[15:34:08] <wikibugs>	 (03PS5) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857
[15:34:10] <wikibugs>	 (03PS4) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122
[15:34:33] <icinga-wm>	 PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:35] <icinga-wm>	 PROBLEM - Host durum5001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:55] <icinga-wm>	 PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:57] <sukhe>	 oh oh
[15:35:41] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:36:16] <wikibugs>	 (03PS5) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005)
[15:36:21] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:36:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10KSiebert) Yes @RhinosF1 I copied my teammates request because I realized that my permissions in superset are kind of limited and I don't understand how.
[15:37:02] <sukhe>	 ^ does anyone know what's up in eqsin? 
[15:37:09] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:37:20] <topranks>	 No just seen it.  Netbox lists ganeti5002 as "failed" I see.
[15:37:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[15:37:44] <sukhe>	 (yeah, that would explain the doh and durum hosts)
[15:37:46] <topranks>	 My assumption was that server failed and took the VMs with it
[15:37:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:37:58] <topranks>	 But not sure, Netbox wouldn't be magically aware of a random failure.
[15:38:19] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:38:20] <sukhe>	 oh yeah so then the BGP/BFD alerts might be related to the doh and durum hosts being down
[15:38:39] <topranks>	 I expect so
[15:38:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1) No problem, whoever is on clinic duty this week will assist and add you. That should be @Jelto.
[15:38:51] <topranks>	 That host is up on it's iDRAC / dedicated management interface anyway
[15:38:56] <topranks>	 cmooney@cumin2002:~$ ping ganeti5002.mgmt.eqsin.wmnet
[15:38:56] <topranks>	 PING ganeti5002.mgmt.eqsin.wmnet (10.132.129.114) 56(84) bytes of data.
[15:38:56] <topranks>	 64 bytes from ganeti5002.mgmt.eqsin.wmnet (10.132.129.114): icmp_seq=1 ttl=61 time=218 ms
[15:38:56] <topranks>	 64 bytes from wmf7194.mgmt.eqsin.wmnet (10.132.129.114): icmp_seq=2 ttl=61 time=218 ms
[15:39:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1)
[15:40:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10RhinosF1)
[15:40:35] <topranks>	 Certainly looks down, from ganeti5001 it has no stats for it, but ganeti5002 does appear to be in the cluster
[15:40:41] <topranks>	 https://www.irccloud.com/pastebin/1GspDp5R/
[15:41:08] <sukhe>	 ah
[15:41:12] <sukhe>	 ineed
[15:41:16] <sukhe>	 indeed even :)
[15:43:42] <topranks>	 ineed my VMs to work says sukhe :D
[15:43:43] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[15:44:04] <sukhe>	 topranks: :D 
[15:44:08] <topranks>	 I can't get onto the iDRAC interface of that box from bast5001, but unsure if that should be possible (may be blocked by fw).
[15:44:50] <wikibugs>	 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10WMDE-Fisch)
[15:44:56] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) (owner: 10Hashar)
[15:45:30] <topranks>	 sukhe: so it looks to me like the host has failed.
[15:45:41] <wikibugs>	 (03PS6) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005)
[15:46:02] <topranks>	 And I expect we probably take 2 paralell actions - re-deploy any missing VMs to the other Ganeti hosts, and work on getting the server back running / replaced.
[15:46:28] <topranks>	 moritzm, XioNoX: Does that sound right (sry don't know who to pick on here)
[15:46:30] <wikibugs>	 (03PS5) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122
[15:46:32] <wikibugs>	 (03PS6) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857
[15:47:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10Tobi_WMDE_SW) Approving that @WMDE-Fisch is in my team and needs the access for the stated reasons. Would be awesome if it could be granted.
[15:47:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:47:41] <topranks>	 Looks like only VMs that were on it are the ones that have alerted
[15:47:49] <topranks>	 https://www.irccloud.com/pastebin/YXctu6pD/
[15:48:02] <wikibugs>	 (03CR) 10JMeybohm: "I did also change order of commits so that CI should no longer fail...maybe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[15:48:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[15:48:21] <jayme>	 well...that went nicely :)
[15:48:28] <wikibugs>	 (03PS7) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005)
[15:49:37] <wikibugs>	 (03PS7) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857
[15:49:47] <XioNoX>	 topranks: I'd say ask the service owners, o11y for prometheus, sukhe for the other 2, and file a high priority task for DCops
[15:49:59] <topranks>	 ok thanks for the advice 
[15:50:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:50:17] <vgutierrez>	 not my day...
[15:50:21] <XioNoX>	 but I'd guess it's going to be a yes on rebuilding the instances :)
[15:50:48] <XioNoX>	 vgutierrez: it's more your day than ganeti5002's day
[15:50:50] <XioNoX>	 :)
[15:50:59] <topranks>	 sukhe: see the advice above, sounds like rebuilding the VMs is the way to go here.
[15:51:43] <XioNoX>	 robh: ^ warning, incoming high priority task about a failed ganeti server in eqsin
[15:52:24] <wikibugs>	 (03PS1) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579)
[15:52:51] <sukhe>	 hello, back
[15:53:02] <andrewbogott>	 !log merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/525220 which makes read-only ldap the default for ldap clients
[15:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff)
[15:53:26] <XioNoX>	 lmata: fyi, prometheus5001 is dead (See above)
[15:53:38] <sukhe>	 so given that the doh and durum are anycasted, I am not worried about failing requests or anything (thankfully!) but yeah
[15:53:48] <sukhe>	 XioNoX: so out of curiosity, what happened here?
[15:54:00] <lmata>	 thanks XioNoX will look into it
[15:54:14] <sukhe>	 (thanks topranks and XioNoX btw)
[15:54:25] <topranks>	 sukhe: I'm not sure, the iDRAC mangement of that server responds to pings, the main IP that the debian OS is using is not responding.
[15:54:35] <topranks>	 Could be anything from a hardware failure to a kernel panic or something.
[15:55:01] <topranks>	 I couldn't reach the iDRAC virtual console, but I only tried from bast5001, not sure if the connection is allowed from there.
[15:55:14] <topranks>	 DC Ops can investigate and advise, but the host is dead right now.
[15:55:19] <XioNoX>	 topranks: idrac over https can only be reached from the cumin hosts, ssh from bast as well (iirc)
[15:55:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[15:57:07] <XioNoX>	 topranks: let me know if you need any help
[15:58:43] <moritzm>	 I'm moving instances off ganeti5002 now
[15:59:36] <XioNoX>	 moritzm: out of curiosity, does that mean re-building the instances, or the storage is duplicated on the other nodes as well?
[16:00:16] <moritzm>	 we'd move to the secondary instances, was there any luck with ganeti5002's mgmt?
[16:00:24] <topranks>	 It's up to ping.
[16:00:36] <moritzm>	 do we get anything in the SEL?
[16:01:34] <XioNoX>	 moritzm: robh is looking into it in -dcops
[16:01:49] <moritzm>	 last event is from Oct about a power blip
[16:02:26] <moritzm>	 it's oopsing
[16:02:38] <volans>	 I get a root login prompt
[16:02:43] <moritzm>	 somewhere in KVM
[16:02:49] <moritzm>	 I'm going to powercycle
[16:03:07] <volans>	 +1
[16:03:09] <volans>	 it's ooming
[16:03:21] <XioNoX>	 moritzm: please sync up with robh as well
[16:03:30] <XioNoX>	 I think he is about to reboot it as well
[16:03:31] <topranks>	 Ok
[16:03:39] <topranks>	 it's at a login prompt ont he virtual console
[16:04:12] <topranks>	 https://usercontent.irccloud-cdn.com/file/mVlyFjDt/image.png
[16:04:28] <volans>	 topranks: try to login and you'll see the oom
[16:04:31] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops: Failed host: ganetti5002 - https://phabricator.wikimedia.org/T295783 (10RhinosF1)
[16:04:49] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops: Failed host: ganeti5002 - https://phabricator.wikimedia.org/T295783 (10RhinosF1)
[16:05:02] <volans>	  16:04:54 up 285 days,  1:37,  1 user,  load average: 65.44, 63.86, 56.40
[16:05:08] <moritzm>	 !log powercycling ganeti5002
[16:05:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:28] <wikibugs>	 (03CR) 10Eigyan: [C: 03+1] Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR)
[16:06:30] <topranks>	 just occured to me - oom = "out of memory"
[16:06:49] <volans>	 console was full of stack traces
[16:06:52] <topranks>	 ok thanks volans, reboot will reset that anyway.
[16:07:01] <volans>	 rebooting bios now
[16:07:12] <wikibugs>	 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) ==an-worker1104== ====Current interfaces snapshot: {F34750374,width=600} ====Current interfaces: * eno1 - SFTP+ - connected...
[16:07:12] <topranks>	 ganeti memory leak bug or something?
[16:07:27] <topranks>	 robh: I'm off the console if you need it
[16:09:44] <wikibugs>	 (03PS9) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR)
[16:10:01] <moritzm>	 topranks: I think some kernel bug
[16:10:08] <topranks>	 ok
[16:10:31] <icinga-wm>	 RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 248.27 ms
[16:10:48] <topranks>	 \o/
[16:10:49] <sukhe>	 hm!
[16:10:54] <wikibugs>	 (03PS8) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005)
[16:11:27] <moritzm>	 due to the powercycle we now a more recent kernel as well (4.19.208 over 4.19.171), but if we're lucky whatever hit it, is already backported into 4.9.208 :-)
[16:12:51] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[16:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:29] <icinga-wm>	 RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 247.37 ms
[16:15:17] <icinga-wm>	 RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 247.83 ms
[16:15:35] <icinga-wm>	 RECOVERY - Host durum5001 is UP: PING OK - Packet loss = 0%, RTA = 248.56 ms
[16:15:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm)
[16:16:21] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32441/console" [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[16:17:11] <icinga-wm>	 PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:17:57] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:18:31] <icinga-wm>	 RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 73, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:19:07] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:19:09] <icinga-wm>	 RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 327, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:19:17] <icinga-wm>	 PROBLEM - Check systemd state on durum5001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:20:58] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops: Failed host: ganeti5002 - https://phabricator.wikimedia.org/T295783 (10MoritzMuehlenhoff) 05Open→03Resolved I powercycled the server over the mgmt and it came back up fine. Closing since there's no fixable hardware issue. As part of the reboot (coincidentally I had rolled ou...
[16:21:27] <icinga-wm>	 RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:50] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Thanks for this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan)
[16:22:27] <moritzm>	 !log systemctl reset-failed ifup@esn13 on durum5001 after restart T273026
[16:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:31] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[16:23:33] <icinga-wm>	 RECOVERY - Check systemd state on durum5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:34] <herron>	 !log systemctl reset-failed ifup@ens13 on prometheus5001 T273026
[16:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: reconstruct gitlab sidekiq message field [puppet] - 10https://gerrit.wikimedia.org/r/739018 (https://phabricator.wikimedia.org/T295731) (owner: 10Cwhite)
[16:23:58] <wikibugs>	 (03CR) 10Volans: cookbook sre.idm.u2f: add cookbook to enable/disable u2f (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[16:27:28] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[16:27:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32442/console" [puppet] - 10https://gerrit.wikimedia.org/r/738919 (owner: 10Giuseppe Lavagetto)
[16:35:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "The change is a noop everywhere, so at the very least it's not harmful to merge." [puppet] - 10https://gerrit.wikimedia.org/r/738919 (owner: 10Giuseppe Lavagetto)
[16:37:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack)
[16:42:20] <wikibugs>	 (03PS2) 10Dzahn: remove scholarships.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/738028 (https://phabricator.wikimedia.org/T243037)
[16:43:11] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579)
[16:43:44] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] "Looks good now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[16:47:09] <wikibugs>	 (03PS1) 10Ahmon Dancy: Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[16:47:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[16:48:50] <wikibugs>	 (03PS2) 10Ahmon Dancy: Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[16:50:21] <wikibugs>	 (03CR) 10Volans: "LGTM, couple of minor nits inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond)
[16:52:52] <wikibugs>	 (03PS3) 10Ahmon Dancy: Added docker::resource_monitor class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[16:56:42] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[16:56:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:57:51] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 100 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[16:58:56] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Although I'm not familiar with the underlying DB, LGTM, nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[16:59:51] <icinga-wm>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[17:00:04] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:30] <rzl>	 puppet window complete ✅
[17:01:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:02:07] <wikibugs>	 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10dancy) >>! In T197470#7506161, @hnowlan wrote: > For the immediate term if there are no objections I will replace all instances of `...
[17:02:10] <wikibugs>	 (03CR) 10Majavah: P::kerberos: automate principal management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah)
[17:02:40] <wikibugs>	 10SRE, 10Scap, 10Release-Engineering-Team (Priority Backlog 📥): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10dancy)
[17:08:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul) @MoritzMuehlenhoff @RobH the 2 ganeti nodes are we racking them in a 10G rack or 1G?  "Networking/Subnet/VLAN/IP: 10G, same VLAN/IP setup as existing Ganeti serv...
[17:11:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:40] <wikibugs>	 (03PS1) 10Majavah: acme_chief: add -rw to ldap certs [puppet] - 10https://gerrit.wikimedia.org/r/739283 (https://phabricator.wikimedia.org/T295150)
[17:18:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "removed from ATS yesterday, nothing in apache access logs" [dns] - 10https://gerrit.wikimedia.org/r/738028 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[17:18:32] <wikibugs>	 (03CR) 10Muehlenhoff: apereo_cas: add cas_u2f script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[17:19:37] <wikibugs>	 (03PS1) 10Majavah: wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150)
[17:19:48] <wikibugs>	 (03PS2) 10Majavah: wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150)
[17:20:02] <mutante>	 !log removing scholarships.wikimedia.org from DNS - T243037 
[17:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:05] <stashbot>	 T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037
[17:20:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10MoritzMuehlenhoff) >>! In T294139#7507224, @Papaul wrote: > @MoritzMuehlenhoff @RobH the 2 ganeti nodes are we racking them in a 10G rack or 1G?  If there's sufficient s...
[17:22:04] <wikibugs>	 (03CR) 10Jeena Huneidi: "recheck" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot)
[17:22:12] <wikibugs>	 (03PS1) 10AOkoth: gitlab: re-enable restore timer [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580)
[17:22:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm but because we already merged one ldap hostname patch today I'm going to wait a bit before merging this one." [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah)
[17:23:10] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot)
[17:23:35] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752)
[17:24:48] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/compiler1002/32443/" [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth)
[17:25:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: re-enable restore timer [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth)
[17:25:49] <wikibugs>	 (03PS2) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579)
[17:25:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10thcipriani) >>! In T295765#7506451, @WMDE-leszek wrote: > Looking at the previous requests of this kind (T269777, T28...
[17:26:02] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10sbassett)
[17:26:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10sbassett)
[17:26:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[17:29:55] <wikibugs>	 (03PS1) 10Btullis: Remove override for spark.local.dir on stat100x servers [puppet] - 10https://gerrit.wikimedia.org/r/739307 (https://phabricator.wikimedia.org/T295346)
[17:30:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10sbassett)
[17:31:07] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32444/console" [puppet] - 10https://gerrit.wikimedia.org/r/739307 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[17:31:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10sbassett)
[17:31:27] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove override for spark.local.dir on stat100x servers [puppet] - 10https://gerrit.wikimedia.org/r/739307 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis)
[17:36:35] <wikibugs>	 (03PS3) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579)
[17:38:20] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Papaul) I downgrade Junos on QFX5100 at https://netbox.wikimedia.org/dcim/rack-elevations/ and did a request system zeroize on it . This is the one we will be using to repl...
[17:38:44] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] gitlab: re-enable restore timer [puppet] - 10https://gerrit.wikimedia.org/r/739306 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth)
[17:41:08] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: disable debug logging outside of staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan)
[17:42:17] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.9 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739043 (owner: 10TrainBranchBot)
[17:44:47] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Volans)
[17:45:39] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: disable debug logging outside of staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/739130 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan)
[17:46:24] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) That works for me, thanks, can you send a calendar invite? Note that the link in your comment doesn't point to any specific device.
[17:47:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:47:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:14] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/739308 (https://phabricator.wikimedia.org/T295717)
[17:49:56] <wikibugs>	 (03PS2) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579)
[17:50:02] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[17:51:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:02] <wikibugs>	 (03PS3) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579)
[17:53:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/739308 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan)
[17:54:22] <wikibugs>	 (03PS4) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579)
[17:54:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[17:58:03] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/739308 (https://phabricator.wikimedia.org/T295717) (owner: 10Hnowlan)
[17:58:14] <wikibugs>	 (03PS1) 10Volans: netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309
[17:58:22] <wikibugs>	 (03CR) 10Jbond: apereo_cas: add cas_u2f script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[17:58:50] <wikibugs>	 (03PS2) 10Volans: netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148)
[17:59:03] <wikibugs>	 (03PS4) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[17:59:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[18:00:04] <jouncebot>	 chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1800).
[18:00:30] <wikibugs>	 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) So at first glance, this looks like the Netbox script will do the right thing. It will delete and recreate the the cable, bu...
[18:00:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[18:00:45] <wikibugs>	 (03PS5) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579)
[18:01:46] <wikibugs>	 (03PS5) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[18:02:04] <wikibugs>	 (03PS4) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579)
[18:02:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[18:02:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148) (owner: 10Volans)
[18:03:15] <wikibugs>	 (03PS6) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[18:03:39] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bumpt to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311
[18:03:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[18:04:29] <wikibugs>	 (03PS7) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[18:04:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[18:04:57] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[18:05:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[18:05:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloud: ceph: libvirt: migrate to new ceph auth abstraction (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez)
[18:05:26] <wikibugs>	 (03PS2) 10MSantos: mobileapps: bump to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311
[18:06:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGMT, nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[18:08:10] <wikibugs>	 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) @BTullis fwiw +1 from my end, thanks for having a look.
[18:09:00] <wikibugs>	 (03PS8) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[18:11:40] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311 (owner: 10MSantos)
[18:13:10] <wikibugs>	 (03PS9) 10Ahmon Dancy: Added docker::resource_monitor and docker::gc classes [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[18:16:38] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 2021-11-16-154934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/739311 (owner: 10MSantos)
[18:17:46] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[18:17:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:29] <wikibugs>	 (03PS6) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579)
[18:19:34] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[18:21:09] <mbsantos>	 I'm deploying mobileapps and started with the staging environment. After the deployment there are no pods running, any thoughts why this is happening? cc/ akosiaris and _joe_ 
[18:22:37] <_joe_>	 mbsantos: how did you determine there are no pods running?
[18:22:57] <mbsantos>	 helfile -e staging status would give me the list of pods running
[18:23:12] <mbsantos>	 I'm assuming this would still be the case
[18:23:15] <_joe_>	 mbsantos: oh yes, that's the transition to helm 3
[18:23:51] <_joe_>	 helm 3 is probably now enabled everywhere for staging, and indeed helm status in helm3 doesn't give you the same amount of info
[18:23:53] <mbsantos>	 ah that makes sense, I was afraid to continue with deployment because of that
[18:23:54] <_joe_>	 so if you do
[18:23:57] <mutante>	 I did "kube_env mobileapps staging" and "kubectl get pods" on deploy1002 and can see pods, fwiw
[18:24:05] <_joe_>	 that ^^
[18:24:16] <_joe_>	 thanks mutante I was typing exactly that :)
[18:24:21] <mbsantos>	 thanks mutante and _joe_
[18:24:25] <mutante>	 glad it was right
[18:24:30] <_joe_>	 but yes, this is a change that should be highlighted to ops@
[18:24:41] <_joe_>	 cc jelto ^^
[18:26:21] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[18:26:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:37] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[18:28:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:33] <wikibugs>	 (03PS5) 10Dzahn: add miscweb to LVS [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538)
[18:31:05] <wikibugs>	 (03CR) 10Dzahn: add miscweb to LVS (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[18:31:40] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6004.drmrs.wmnet with OS bullseye
[18:31:40] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6002.drmrs.wmnet with OS bullseye
[18:31:40] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6001.drmrs.wmnet with OS bullseye
[18:31:40] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6003.drmrs.wmnet with OS bullseye
[18:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye
[18:31:52] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6001.drmrs.wmnet with OS bullseye
[18:31:54] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6004.drmrs.wmnet with OS bullseye
[18:32:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye
[18:32:53] <wikibugs>	 (03PS15) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914
[18:33:21] <wikibugs>	 (03CR) 10Jbond: hiera: create script endpoint for exporting hiera data (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond)
[18:33:56] <wikibugs>	 (03PS1) 10Jeena Huneidi: testwikis wikis to 1.38.0-wmf.9  refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739316
[18:34:00] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.38.0-wmf.9  refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739316 (owner: 10Jeena Huneidi)
[18:34:56] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.9  refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739316 (owner: 10Jeena Huneidi)
[18:34:59] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.9  refs T293950
[18:35:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:02] <stashbot>	 T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950
[18:37:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:37:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:41:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:59] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in a2-eqiad
[18:42:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:18] <wikibugs>	 (03PS1) 10Herron: role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620)
[18:48:05] <wikibugs>	 (03PS1) 10Herron: role::elasticsearch::relforge: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620)
[18:48:31] <wikibugs>	 (03PS2) 10Herron: role::elasticsearch::relforge: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620)
[18:49:03] <wikibugs>	 (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron)
[18:50:01] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[18:50:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] WMCS haproxy: set expose-fd listeners for all services [puppet] - 10https://gerrit.wikimedia.org/r/737986 (owner: 10Andrew Bogott)
[18:51:21] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6002.drmrs.wmnet with OS bullseye
[18:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye executed with errors: - gan...
[18:53:16] <wikibugs>	 (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron)
[18:54:48] <wikibugs>	 (03PS1) 10Andrew Bogott: mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355)
[18:55:22] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in a3-eqiad
[18:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott)
[18:56:25] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti6003.drmrs.wmnet with OS bullseye
[18:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:33] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye executed with errors: - gan...
[18:56:35] <wikibugs>	 (03PS10) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[18:57:04] <icinga-wm>	 PROBLEM - Host wcqs1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:58:00] <icinga-wm>	 PROBLEM - Host db1141.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:58:22] <icinga-wm>	 PROBLEM - Host mw1414.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:58:35] <icinga-wm>	 PROBLEM - Host cloudservices1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:58:47] <wikibugs>	 (03PS2) 10Andrew Bogott: mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355)
[18:59:24] <icinga-wm>	 PROBLEM - Host maps1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:59:40] <icinga-wm>	 PROBLEM - Host analytics1059.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:00:04] <jouncebot>	 Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T1900)
[19:00:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[19:01:02] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in a4-eqiad
[19:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:01:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:08] <icinga-wm>	 RECOVERY - Host wcqs1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms
[19:02:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] mark_tool: Remove reliance of /etc/ldap.conf or /etc/ldap/ldap.conf [puppet] - 10https://gerrit.wikimedia.org/r/739326 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott)
[19:03:18] <icinga-wm>	 RECOVERY - Host db1141.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms
[19:03:22] <icinga-wm>	 PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[19:03:34] <icinga-wm>	 PROBLEM - Host labstore1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:03:38] <icinga-wm>	 PROBLEM - Host contint1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:03:42] <icinga-wm>	 RECOVERY - Host mw1414.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms
[19:03:46] <icinga-wm>	 PROBLEM - Host clouddb1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:03:48] <icinga-wm>	 PROBLEM - Host cloudelastic1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:03:52] <icinga-wm>	 PROBLEM - Host ganeti1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:03:52] <icinga-wm>	 RECOVERY - Host cloudservices1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms
[19:04:24] <icinga-wm>	 PROBLEM - Host ms-be1046.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:04:36] <icinga-wm>	 RECOVERY - Host maps1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.01 ms
[19:04:38] <icinga-wm>	 PROBLEM - Host netmon1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:04:56] <icinga-wm>	 RECOVERY - Host analytics1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms
[19:05:32] <icinga-wm>	 RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms
[19:06:44] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in a5-eqiad
[19:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:48] <icinga-wm>	 RECOVERY - Host netmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms
[19:08:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[19:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:06] <icinga-wm>	 RECOVERY - Host labstore1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms
[19:09:12] <icinga-wm>	 RECOVERY - Host contint1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.14 ms
[19:09:22] <icinga-wm>	 RECOVERY - Host clouddb1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms
[19:09:24] <icinga-wm>	 RECOVERY - Host cloudelastic1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms
[19:09:30] <icinga-wm>	 RECOVERY - Host ganeti1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[19:10:02] <icinga-wm>	 RECOVERY - Host ms-be1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.31 ms
[19:10:25] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6003.drmrs.wmnet with OS bullseye
[19:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:38] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye
[19:11:03] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in a7-eqiad
[19:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:10] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti6002.drmrs.wmnet with OS bullseye
[19:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:14] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6004.drmrs.wmnet with OS bullseye
[19:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:19] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye
[19:11:23] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6004.drmrs.wmnet with OS bullseye completed: - ganeti6004 (**...
[19:11:31] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.9  refs T293950 (duration: 36m 32s)
[19:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:35] <stashbot>	 T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950
[19:13:24] <icinga-wm>	 PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[19:13:54] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 813.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:14:11] <wikibugs>	 (03PS11) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[19:14:18] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6001.drmrs.wmnet with OS bullseye
[19:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6001.drmrs.wmnet with OS bullseye completed: - ganeti6001 (**...
[19:14:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:43] <wikibugs>	 (03PS1) 10Andrew Bogott: disable_tool: add ldap uri to the config file [puppet] - 10https://gerrit.wikimedia.org/r/739331 (https://phabricator.wikimedia.org/T170355)
[19:14:44] <icinga-wm>	 RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 11.38 ms
[19:15:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10Majavah)
[19:15:08] <logmsgbot>	 !log jhuneidi@deploy1002 Pruned MediaWiki: 1.38.0-wmf.6 (duration: 03m 17s)
[19:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:29] <wikibugs>	 (03PS2) 10Herron: role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620)
[19:15:33] <wikibugs>	 (03PS12) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[19:16:04] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@194b11b]: Regular analytics weekly train [analytics/refinery@194b11b]
[19:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:14] <icinga-wm>	 PROBLEM - Host mw1448.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:16:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[19:17:41] <wikibugs>	 (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron)
[19:18:17] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in b1-eqiad
[19:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:47] <wikibugs>	 (03PS13) 10Ahmon Dancy: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707)
[19:18:54] <icinga-wm>	 PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[19:19:33] <majavah>	 asw?
[19:19:42] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[19:20:35] <Amir1>	 majavah: if you're asking what does it mean, access switch 
[19:21:11] <icinga-wm>	 PROBLEM - Host kubestage1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:21:30] <majavah>	 Amir1: I know, I'm just wondering why that went down since !logs are about msw's and that's a different row than what was just then being worked on
[19:21:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: add ldap uri to the config file [puppet] - 10https://gerrit.wikimedia.org/r/739331 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott)
[19:22:04] <Amir1>	 I mean it shouldn't even alert if it's down timed
[19:22:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul)
[19:22:09] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[19:22:30] <Amir1>	 it's a different row
[19:22:35] <icinga-wm>	 RECOVERY - Host kubestage1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.22 ms
[19:23:56] <icinga-wm>	 RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms
[19:23:57] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/739334 (https://phabricator.wikimedia.org/T295781) (owner: 10Awight)
[19:24:17] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: (Need By: TBD) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH)
[19:24:31] <icinga-wm>	 PROBLEM - Host wcqs1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:24:43] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[19:26:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10Papaul)
[19:27:55] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in b2-eqiad
[19:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:18] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: TBD) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH) p:05Medium→03High
[19:29:29] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1759.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:29:49] <icinga-wm>	 RECOVERY - Host wcqs1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms
[19:30:47] <icinga-wm>	 PROBLEM - Host cloudcephmon1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:31:01] <icinga-wm>	 PROBLEM - Host clouddb1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:31:07] <icinga-wm>	 PROBLEM - Host ms-be1058.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:31:13] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH)
[19:31:16] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH)
[19:33:46] <wikibugs>	 (03CR) 10Cathal Mooney: "Looks good!  I've had a good look through and stepped through the scenarios I could imagine, I think it should cover the use-case for drmr" [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi)
[19:34:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] Add drmrs switches to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi)
[19:34:22] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in b3-eqiad
[19:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:42] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={LIST,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[19:35:54] <icinga-wm>	 RECOVERY - Host cloudcephmon1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms
[19:36:04] <icinga-wm>	 RECOVERY - Host clouddb1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.29 ms
[19:36:10] <icinga-wm>	 RECOVERY - Host ms-be1058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
[19:36:22] <icinga-wm>	 PROBLEM - Host conf1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:36:26] <icinga-wm>	 PROBLEM - Host db1104.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:36:42] <icinga-wm>	 PROBLEM - Host mw1429.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:10] <icinga-wm>	 PROBLEM - Host mw1428.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:10] <icinga-wm>	 PROBLEM - Host mw1430.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:10] <icinga-wm>	 PROBLEM - Host mw1431.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:10] <icinga-wm>	 PROBLEM - Host mw1432.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:10] <icinga-wm>	 PROBLEM - Host mw1433.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:38:18] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@194b11b]: Regular analytics weekly train [analytics/refinery@194b11b] (duration: 22m 14s)
[19:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:42] <icinga-wm>	 RECOVERY - Host mw1448.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[19:39:46] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@194b11b] (thin): Regular analytics weekly train THIN [analytics/refinery@194b11b]
[19:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:53] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@194b11b] (thin): Regular analytics weekly train THIN [analytics/refinery@194b11b] (duration: 00m 07s)
[19:39:54] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[19:39:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:00] <icinga-wm>	 RECOVERY - Host mw1430.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms
[19:40:04] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@194b11b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@194b11b]
[19:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:34] <icinga-wm>	 PROBLEM - Host moss-be1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:41:36] <icinga-wm>	 RECOVERY - Host conf1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[19:41:38] <icinga-wm>	 PROBLEM - Host copernicium.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:41:38] <icinga-wm>	 RECOVERY - Host db1104.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms
[19:41:54] <icinga-wm>	 RECOVERY - Host mw1429.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.05 ms
[19:42:26] <icinga-wm>	 RECOVERY - Host mw1428.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms
[19:42:26] <icinga-wm>	 RECOVERY - Host mw1431.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[19:42:26] <icinga-wm>	 RECOVERY - Host mw1433.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms
[19:42:26] <icinga-wm>	 RECOVERY - Host mw1432.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms
[19:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[19:43:13] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in b5-eqiad
[19:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:36] <icinga-wm>	 RECOVERY - Host copernicium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms
[19:45:17] <wikibugs>	 (03PS1) 10Ppchelko: Demo: load a config variable from JSON file in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739336
[19:45:22] <icinga-wm>	 PROBLEM - Host db1164.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:45:24] <icinga-wm>	 PROBLEM - Host mw1395.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:45:24] <icinga-wm>	 PROBLEM - Host mw1397.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:45:30] <icinga-wm>	 PROBLEM - Host db1179.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:45:44] <icinga-wm>	 PROBLEM - Host restbase1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:45:50] <icinga-wm>	 PROBLEM - Host wdqs1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:46:00] <icinga-wm>	 RECOVERY - Host moss-be1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.82 ms
[19:46:02] <wikibugs>	 (03PS3) 10Legoktm: httpbb: Add some tests for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477)
[19:46:38] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={list,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[19:46:57] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@194b11b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@194b11b] (duration: 06m 53s)
[19:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:30] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "legoktm@cumin1001:~$ httpbb --hosts thumbor1001.eqiad.wmnet --http_port 8800 ~/test_thumbor.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[19:47:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[19:48:10] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[19:49:18] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:50:38] <wikibugs>	 (03PS2) 10Ppchelko: Demo: load a config variable from JSON file in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739336
[19:51:25] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6003.drmrs.wmnet with OS bullseye
[19:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:35] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti6002.drmrs.wmnet with OS bullseye
[19:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:40] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6003.drmrs.wmnet with OS bullseye completed: - ganeti6003 (**...
[19:51:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host ganeti6002.drmrs.wmnet with OS bullseye completed: - ganeti6002 (**...
[19:52:06] <icinga-wm>	 RECOVERY - Host db1164.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms
[19:52:06] <icinga-wm>	 RECOVERY - Host db1179.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms
[19:52:34] <icinga-wm>	 RECOVERY - Host mw1395.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms
[19:52:34] <icinga-wm>	 RECOVERY - Host mw1397.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms
[19:52:36] <icinga-wm>	 RECOVERY - Host restbase1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms
[19:52:36] <icinga-wm>	 RECOVERY - Host wdqs1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms
[19:52:45] <cmjohnson1>	 !log moving mgmt cables from old msw to new msw in b7-eqiad
[19:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:42] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms
[19:55:10] <icinga-wm>	 PROBLEM - Host dbprov1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:55:20] <icinga-wm>	 PROBLEM - Host cloudcephmon1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:55:20] <icinga-wm>	 PROBLEM - Host cloudcephosd1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:55:34] <icinga-wm>	 PROBLEM - Host kafka-main1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:57:03] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[19:58:52] <icinga-wm>	 PROBLEM - Host mw1401.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:58:52] <icinga-wm>	 PROBLEM - Host mw1399.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:58:52] <icinga-wm>	 PROBLEM - Host mw1402.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:59:22] <icinga-wm>	 RECOVERY - Host mw1399.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms
[20:00:04] <jouncebot>	 jeena and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211116T2000). Please do the needful.
[20:00:34] <icinga-wm>	 RECOVERY - Host dbprov1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms
[20:00:46] <icinga-wm>	 RECOVERY - Host cloudcephosd1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms
[20:00:46] <icinga-wm>	 RECOVERY - Host cloudcephmon1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms
[20:01:00] <icinga-wm>	 RECOVERY - Host kafka-main1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms
[20:03:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek)
[20:03:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10WMDE-leszek) Thanks @thcipriani. I conclude that WMF approval is not required then.
[20:03:54] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcephmon1002, stat1005, cloudcephmon1003, cloudcephmon1001, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[20:03:58] <jeena>	 I didn't realize I had left this channel so I can't see the backscroll. I am going to deploy the train now. If there was anything that should hold it up please advise
[20:04:24] <icinga-wm>	 RECOVERY - Host mw1401.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms
[20:04:24] <icinga-wm>	 RECOVERY - Host mw1402.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms
[20:04:35] <mutante>	 jeena: there are some alerts but it's only maintenance on mgmt, seems clear for the train
[20:04:51] <jeena>	 thanks mutante 
[20:04:53] <mutante>	 (as long as it stays .mgmt)
[20:07:34] <wikibugs>	 (03PS1) 10Jeena Huneidi: group0 wikis to 1.38.0-wmf.9  refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739338
[20:07:36] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.38.0-wmf.9  refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739338 (owner: 10Jeena Huneidi)
[20:08:17] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.9  refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739338 (owner: 10Jeena Huneidi)
[20:09:26] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.9  refs T293950
[20:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:30] <stashbot>	 T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950
[20:10:22] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "[cumin1001:~] $ httpbb --hosts thumbor1001.eqiad.wmnet --http_port 8800 /home/legoktm/test_thumbor.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[20:11:16] <wikibugs>	 (03PS3) 10Dzahn: mediawiki/parsoid/wikitech: flip default for font install [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378)
[20:13:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:44] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:17:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond)
[20:19:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/32452/" [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn)
[20:21:00] <wikibugs>	 (03CR) 10Dzahn: "compiles noop everywhere, just switching the default value to "false" now and removing Hiera lines" [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn)
[20:28:08] <wikibugs>	 (03PS2) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037)
[20:30:21] <wikibugs>	 (03CR) 10Volans: "Replies inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond)
[20:33:04] <wikibugs>	 (03CR) 10Dzahn: "ahaha, puppet duplicate declaration that is USEFUL - it tells us what else uses php-mysql here so we can't remove that. Duplicate declarat" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[20:33:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:34] <wikibugs>	 (03PS3) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037)
[20:36:20] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10observability: Improve mailman3 queue alerting - https://phabricator.wikimedia.org/T295805 (10Volans) As a quick fix you could tweak the `check_interval`, `max_check_attempts` and `retry_interval` Icinga parameters that are exposed in `nrpe::monitor_service` as `check_inte...
[20:38:10] <wikibugs>	 (03CR) 10Dzahn: "better, but removing the entire scap deploy service, is it used by other sites?  https://puppet-compiler.wmflabs.org/compiler1003/32456/mi" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[20:39:07] <dcausse>	 !log restarting blazegraph on wdqs1005 (jvm stuck)
[20:39:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:13] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "absenting one specific scap::target on a server with multiple scap targets would not just remove one target but break them all because it " [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[20:42:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:57] <wikibugs>	 (03PS4) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037)
[20:45:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "removing (commeting out) but not absenting is the way to go for removing scap::targets https://puppet-compiler.wmflabs.org/compiler1002/32" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[20:45:27] <wikibugs>	 (03PS5) 10Dzahn: wikimania_scholarships: let the module start to remove itself [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037)
[20:46:03] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/32457/" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[20:49:55] <wikibugs>	 (03CR) 10Dzahn: "Motd/File[/etc/update-motd.d/05-role-wikimania-scholarships]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/737982 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[20:51:09] <mutante>	 !log [miscweb2002:/var/cache] $ sudo rm -rf scholarships/
[20:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:27] <wikibugs>	 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH)
[20:56:05] <wikibugs>	 (03PS1) 10Dzahn: httpbb/miscweb: drop tests for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037)
[20:56:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb/miscweb: drop tests for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[20:58:35] <wikibugs>	 (03PS2) 10Dzahn: httpbb/miscweb: drop tests for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037)
[20:59:36] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:02:17] <wikibugs>	 (03CR) 10Dzahn: "[cumin1001:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb* --hosts miscweb1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/739347 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[21:06:37] <wikibugs>	 (03PS1) 10Herron: mailman3_queue_size: increase check intervals [puppet] - 10https://gerrit.wikimedia.org/r/739351 (https://phabricator.wikimedia.org/T295805)
[21:18:44] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:21:28] <wikibugs>	 (03PS1) 10Dzahn: acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673)
[21:23:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[21:24:05] <wikibugs>	 (03PS2) 10Dzahn: acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673)
[21:24:45] <wikibugs>	 (03PS3) 10Dzahn: acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673)
[21:24:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[21:27:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/32458/" [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[21:32:30] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:40:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH)
[21:43:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10RobH)
[21:53:49] <wikibugs>	 (03CR) 10Juan90264: [C: 03+1] Disable local file upload on the Chinese Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738550 (https://phabricator.wikimedia.org/T295265) (owner: 104nn1l2)
[21:58:32] <icinga-wm>	 PROBLEM - DNS on mw1448.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.26 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:10:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron)
[22:11:17] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron)
[22:14:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10Jclark-ctr) @elukey  if you can update names when you get a chance Thanks   Host Racked waiting on cabling in case something changes  ml-train1001 a2...
[22:15:51] <wikibugs>	 10SRE, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Brooke Camarda & Olga Spingou (superset, turnilo, hue) - https://phabricator.wikimedia.org/T295828 (10CGlenn)
[22:19:29] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] "Shipping, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/739025 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[22:20:26] <wikibugs>	 (03PS14) 10Brennen Bearnes: Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[22:35:38] <wikibugs>	 10SRE, 10MediaWiki-Categories, 10Russian-Sites, 10Serbian-Sites: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281 (10FriedrickMILBarbarossa)
[22:39:04] <wikibugs>	 (03PS1) 10Legoktm: thumbor: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739361 (https://phabricator.wikimedia.org/T285477)
[22:39:06] <wikibugs>	 (03PS1) 10Legoktm: conftool: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739362 (https://phabricator.wikimedia.org/T285477)
[22:42:01] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] thumbor: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739361 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[22:42:33] <wikibugs>	 (03CR) 10Dzahn: Added docker::gc class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[22:43:18] <wikibugs>	 (03CR) 10Dzahn: Added docker::gc class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[22:43:22] <wikibugs>	 (03CR) 10Ahmon Dancy: Added docker::gc class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[22:47:31] <mutante>	 dancy: so since gitlab-runners project has its own puppetmaster, does it mean changes like this can be applied there before we merge in prod?
[22:47:48] <mutante>	 was about to compile that 
[22:47:59] <dancy>	 Yes. I did do that with this commit and tested on gitlab-runner1008 
[22:48:20] <mutante>	 trying to find an instance though that the compiler already knows
[22:48:25] <mutante>	 and has facts for
[22:48:35] <mutante>	 alright, in that case.. I will just merge it :)
[22:48:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Added docker::gc class [puppet] - 10https://gerrit.wikimedia.org/r/739280 (https://phabricator.wikimedia.org/T295707) (owner: 10Ahmon Dancy)
[22:48:50] <dancy>	 Thanks!
[22:49:58] <mutante>	 done! do you know manually pull on the local master or it just happens?
[22:50:06] <mutante>	 now
[22:50:25] <dancy>	 It'll happen manually, but I can pull now to get it moving along
[22:50:46] <mutante>	 alright, cool
[22:53:35] <dancy>	 Looks like brennen has something in progress on the puppet master.  Waiting for him
[22:54:15] <mutante>	  gitlab-runners-puppetmaster-01  has Hiera: puppetmaster: gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud
[22:54:36] <mutante>	 but that does not seem to mean it gets confused about who is its own mater
[22:55:20] <mutante>	 ack, dancy, no rush
[22:55:31] <dancy>	 👍🏾
[22:58:45] <wikibugs>	 (03PS1) 10Dzahn: gitlab-runners: move profile::gitlab::runner::docker_volume: true to repo [puppet] - 10https://gerrit.wikimedia.org/r/739366
[23:00:29] <wikibugs>	 (03PS1) 10Dzahn: gitlab-runners: move puppetmaster setting to repo [puppet] - 10https://gerrit.wikimedia.org/r/739367
[23:06:56] <wikibugs>	 (03PS4) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198)
[23:13:27] <wikibugs>	 (03PS1) 10Clare Ming: Add new icons, wordmarks, taglines for several wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091)
[23:14:01] <wikibugs>	 10SRE, 10Traffic: Image requests sending neither "Last-Modified" nor "ETag" HTTP headers. - https://phabricator.wikimedia.org/T295556 (10Ade56facc) OK, I have seen again responses from server Thumbor without headers named in bug title.  I have reloaded web page a few times using key F5 in Chrome browser (which...
[23:14:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add new icons, wordmarks, taglines for several wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming)
[23:17:31] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[23:18:00] <wikibugs>	 (03PS1) 10Dzahn: admin: add Julia Kieserman to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693)
[23:18:10] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] conftool: Add thumbor1005 [puppet] - 10https://gerrit.wikimedia.org/r/739362 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[23:18:31] <legoktm>	 ryankemper: OK to merge your change?
[23:18:45] <ryankemper>	 legoktm: fire away
[23:19:01] <legoktm>	 {{done}}
[23:19:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10Dzahn) Thank you, Julia!  I uploaded a change to code review. This should continue from there shortly. Cheers, Daniel
[23:19:05] <ryankemper>	 ty
[23:19:15] <wikibugs>	 (03PS2) 10Clare Ming: Add new icons, wordmarks, taglines for several wikis: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091)
[23:19:15] <ryankemper>	 !log T276198 `ryankemper@cumin1001:~$ sudo cumin '*elastic*' 'sudo disable-puppet "Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/721644"'` (done a few mins ago)
[23:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:19:19] <stashbot>	 T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198
[23:20:46] <wikibugs>	 (03PS2) 10Dzahn: admin: add Julia Kieserman to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693)
[23:21:11] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1005.eqiad.wmnet
[23:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:13] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1005.eqiad.wmnet
[23:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:32] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1005.eqiad.wmnet
[23:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:06] <legoktm>	 I pooled it for a minute but depooled because it seemed to be returning 404s for everything
[23:23:10] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "[mwmaint1002:~] $  ldapsearch -x uid=jkieserman" [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693) (owner: 10Dzahn)
[23:25:44] <legoktm>	 checking the 404s they all seem legit
[23:25:51] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1005.eqiad.wmnet
[23:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:17] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/weight=5; selector: name=thumbor1005.eqiad.wmnet
[23:27:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:51] <ryankemper>	 !log T276198 `ryankemper@elastic1049:~$ sudo run-puppet-agent --force`;  `elasticsearch_6@production-search-eqiad.service ` didn't restart but it looks like there might be slightly wrong with the new `ExecPreStart` line => `Executable path is not absolute, ignoring: systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf`
[23:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:54] <stashbot>	 T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198
[23:37:57] <wikibugs>	 (03PS1) 10Legoktm: Move thumbor1006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739374
[23:40:03] <wikibugs>	 (03PS2) 10Legoktm: Move thumbor1006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739374 (https://phabricator.wikimedia.org/T285477)
[23:42:11] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Move thumbor1006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739374 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm)
[23:42:53] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] cassandra: move cluster:user relation from 1:1 relation to a 1:many [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan)
[23:43:27] <wikibugs>	 (03PS1) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198)
[23:43:45] <wikibugs>	 (03PS2) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198)
[23:44:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[23:46:52] <wikibugs>	 (03PS3) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198)
[23:49:23] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[23:55:53] <wikibugs>	 (03PS4) 10Ryan Kemper: elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198)
[23:57:23] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[23:58:19] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[23:58:43] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: use absolute path of binary [puppet] - 10https://gerrit.wikimedia.org/r/739375 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[23:59:38] <ryankemper>	 !log T276198 `ryankemper@elastic1049:~$ sudo run-puppet-agent --force` to test out https://gerrit.wikimedia.org/r/c/operations/puppet/+/739375
[23:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:42] <stashbot>	 T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198