[00:03:20] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:25:22] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:59:54] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37
[01:24:38] <wikibugs>	 (03CR) 10Legoktm: "> Patch Set 2:" (031 comment) [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533 (owner: 10Legoktm)
[01:24:42] <wikibugs>	 (03PS3) 10Legoktm: Add qqq [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533
[01:24:44] <wikibugs>	 (03PS1) 10Legoktm: Don't translate blank templates [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699981
[01:24:46] <wikibugs>	 (03PS1) 10Legoktm: Use one variable syntax, remove tabs [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699982
[01:26:08] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:43:44] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] Don't translate blank templates [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699981 (owner: 10Legoktm)
[01:44:11] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] Use one variable syntax, remove tabs [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/699982 (owner: 10Legoktm)
[01:44:52] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) Pushed some new commits, I think I've addressed all the feedback so far.
[03:26:04] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37
[04:59:22] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37
[05:04:27] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/699987
[05:05:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/699987 (owner: 10Marostegui)
[05:06:56] <marostegui>	 !log Upgrade clouddb1014
[05:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:07:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1014." [puppet] - 10https://gerrit.wikimedia.org/r/699864
[05:09:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1014." [puppet] - 10https://gerrit.wikimedia.org/r/699864 (owner: 10Marostegui)
[06:43:44] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:52:11] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[06:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:48] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700011
[06:53:50] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700011 (owner: 10QChris)
[06:55:50] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210616T0700)
[07:07:27] <elukey>	 qchris: o/ thanks a lot
[07:07:43] <qchris>	 yw :)
[07:09:29] <wikibugs>	 (03PS1) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192)
[07:09:39] <wikibugs>	 (03PS5) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576)
[07:09:41] <wikibugs>	 (03PS2) 10Ema: varnish: add prometheus histogram varnish_processing_seconds [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576)
[07:10:21] <wikibugs>	 (03PS2) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192)
[07:12:58] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[07:14:33] <wikibugs>	 (03PS3) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192)
[07:27:26] <dcausse>	 !log cleanup old /var/log/airflow/scheduler logs to reclaim space on an-airflow1001
[07:27:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:49] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, a few nits inline." (035 comments) [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[07:35:13] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) 05Open→03Resolved
[07:45:00] <wikibugs>	 (03PS4) 10Elukey: Add initial debianization for istioctl 1.6.14 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192)
[07:45:33] <wikibugs>	 (03CR) 10Elukey: "Thanks a lot for the review! Tried to fix all the comments, hope it is better now :)" (035 comments) [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[07:48:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[07:59:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I think you'll want to append the istio version (or at least major and minor) to the debian package name as well as the filename of the bi" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[08:03:30] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980)
[08:03:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980) (owner: 10Jcrespo)
[08:06:25] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980)
[08:09:42] <wikibugs>	 (03PS3) 10Jcrespo: Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980)
[08:09:58] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:15:10] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Switchover s7&s8 codfw backups from db2100 to db2098" [puppet] - 10https://gerrit.wikimedia.org/r/699865 (https://phabricator.wikimedia.org/T284980) (owner: 10Jcrespo)
[08:19:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, just an observation inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema)
[08:24:19] <Amir1>	 !log running "update flaggedrevs set fr_quality = 0 where fr_quality != 0;" on all wikis where flagged revs is enabled (T279761)
[08:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:23] <stashbot>	 T279761: When reviewing pending changes, raw message ID "⧼revreview-hist-quality⧽" shown instead of human readable string - https://phabricator.wikimedia.org/T279761
[08:28:08] <wikibugs>	 (03PS1) 10Jbond: gitlab: add ability to only listen on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/700020
[08:29:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] gitlab: add ability to only listen on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/700020 (owner: 10Jbond)
[08:29:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:31:12] <jynus>	 jobrunner issues
[08:31:53] <Amir1>	 the maxlag is going up but that's sorta expected 
[08:31:54] <Amir1>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=6&orgId=1
[08:32:02] <Amir1>	 it'll finish soon
[08:32:31] <Amir1>	 currently at ukwiki
[08:32:57] <Amir1>	 done
[08:33:24] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:38:04] <wikibugs>	 (03PS1) 10Jbond: gitlab: add listen_port and listent_https to template [puppet] - 10https://gerrit.wikimedia.org/r/700023
[08:41:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: include nutcracker and mcrouter pools in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli)
[08:41:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] gitlab: add listen_port and listent_https to template [puppet] - 10https://gerrit.wikimedia.org/r/700023 (owner: 10Jbond)
[08:49:57] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 4: Code-Review-1" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[08:53:22] <wikibugs>	 (03PS4) 10Kormat: mariadb: Automatically manage pt-heartbeat. [puppet] - 10https://gerrit.wikimedia.org/r/699213
[08:59:26] <wikibugs>	 (03PS4) 10Kormat: mariadb: Promote db1157 as s3 primary [puppet] - 10https://gerrit.wikimedia.org/r/698981 (https://phabricator.wikimedia.org/T284648)
[09:01:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 4:" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[09:03:26] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 4: Code-Review+1" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[09:04:05] <kormat>	 !log uploaded wmfmariadbpy 0.7.1 to apt.wm.o
[09:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:20] <kormat>	 !log Deploying wmfmariadbpy 0.7.1 T284819
[09:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:23] <stashbot>	 T284819: Deploy wmfmariadbpy 0.7.1 - https://phabricator.wikimedia.org/T284819
[09:07:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 4:" [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700012 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[09:10:42] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:11:37] <wikibugs>	 10SRE: debdeploy does not support bullseye - https://phabricator.wikimedia.org/T285034 (10Kormat)
[09:21:02] <wikibugs>	 (03PS2) 10Kormat: wmnet: Update s3-master to db1157 [dns] - 10https://gerrit.wikimedia.org/r/698982 (https://phabricator.wikimedia.org/T284648)
[09:23:02] <kormat>	 jouncebot: next
[09:23:02] <jouncebot>	 In 21 hour(s) and 36 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210617T0700)
[09:23:30] <wikibugs>	 (03PS1) 10Kormat: Revert "db-eqiad.php: Set pc1010 as pc3 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700026
[09:24:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034)
[09:25:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034)
[09:26:15] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034) (owner: 10Muehlenhoff)
[09:27:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/700046 (https://phabricator.wikimedia.org/T285034) (owner: 10Muehlenhoff)
[09:28:32] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] maps: fix SQL modules paths in import script [puppet] - 10https://gerrit.wikimedia.org/r/695240 (owner: 10MSantos)
[09:28:39] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps: fix SQL modules paths in import script [puppet] - 10https://gerrit.wikimedia.org/r/695240 (owner: 10MSantos)
[09:28:45] <wikibugs>	 (03PS1) 10Kormat: pc1010: Move back to pc1, now that the maintenance is done. [puppet] - 10https://gerrit.wikimedia.org/r/700047 (https://phabricator.wikimedia.org/T282761)
[09:29:51] <wikibugs>	 10SRE, 10Patch-For-Review: debdeploy does not support bullseye - https://phabricator.wikimedia.org/T285034 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Fixed. I ran puppet on cumin hosts, should work now.
[09:31:53] * apergos peeks in
[09:31:54] <jynus>	 same issue?
[09:31:58] <joe>	 i just got multipaged
[09:31:58] <volans>	 got paged
[09:32:03] <moritzm>	 expired downtime?
[09:32:08] <joe>	 possibly
[09:32:12] <volans>	 it's splunk that didn't resolve the incidents?
[09:32:23] <godog>	 mmhh yeah I suspect that's right volans 
[09:32:24] <jynus>	 ah  could be
[09:32:26] <joe>	 let me see if there is any impact
[09:32:31] <jynus>	 as there is nothing on irc
[09:33:00] <sobanski>	 I acked all of them
[09:33:03] <apergos>	 I mean I just can't believe Telia has a second outage in as many days etc.
[09:33:16] <joe>	 yup no actuall issue I can see
[09:33:53] <volans>	 sobanski: we need to resolve them though
[09:34:06] <volans>	 or they will page again tomorrow AFAIK, godog correct me if I'm wrong
[09:34:11] <apergos>	 heh
[09:34:24] <apergos>	 and then we will all be here again saying "a third outage? wtf" :-D
[09:34:37] <sobanski>	 volans: sure thing, as soon as we're confident there's no actual problem :)
[09:35:42] <godog>	 volans: that's right yeah, +1 to resolve IMHO
[09:36:03] <joe>	 sobanski: I'm pretty sure it's not but let's wait for XioNoX and topranks  to confirm
[09:36:15] <topranks>	 I've not found any issue thus far.
[09:36:40] <joe>	 yeah no user-facing problem, so I'd say it's safe to resolve
[09:36:46] <topranks>	 Telia link looks ok
[09:37:33] <apergos>	 I see nada as well
[09:37:49] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] Revert "db-eqiad.php: Set pc1010 as pc3 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700026 (owner: 10Kormat)
[09:38:24] <topranks>	 yeah I'm happy enough that eqsin has not been affected network wise
[09:38:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set pc1010 as pc3 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700026 (owner: 10Kormat)
[09:40:37] <logmsgbot>	 !log kormat@deploy1002 Synchronized wmf-config/db-eqiad.php: Repool pc1009 as pc3 primary T282761 (duration: 00m 59s)
[09:40:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:42] <stashbot>	 T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761
[09:44:10] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] pc1010: Move back to pc1, now that the maintenance is done. [puppet] - 10https://gerrit.wikimedia.org/r/700047 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat)
[09:46:20] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:47:20] <kormat>	 !log truncating all pc* tables on pc1010 T282761
[09:47:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:24] <stashbot>	 T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761
[09:50:07] <hnowlan>	 !log disabling puppet on maps1* to reparent maps1007 from new master maps1009
[09:50:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:04] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009
[09:51:04] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009
[09:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:42] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1007.eqiad.wmnet
[09:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:42] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: make maps1007 a buster replica of the new imposm cluster [puppet] - 10https://gerrit.wikimedia.org/r/699782 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan)
[09:58:24] <volans>	 hnowlan: FYI that downtime will expire during the weekend (in case it might page)
[09:58:40] <hnowlan>	 volans: ack, I'll be undoing it or extending it before then 
[09:58:49] <volans>	 np, thx
[10:00:11] <sobanski>	 Closing the Splunk thread, I'll go and resolve them all unless anyone thinks otherwise
[10:10:41] <godog>	 sobanski: +1
[10:11:40] <sobanski>	 We probably need to follow up on why they weren't auto-resolved, or is that a known scenario?
[10:13:41] <godog>	 sobanski: sometimes it happens yeah, though we've focused more on alertmanager than improving icinga since it doesn't happen very often
[10:13:58] <sobanski>	 Got it
[10:15:01] <volans>	 godog: would it be easy to create a simple check in icinga that alerts here on IRC (no page) if there is any open incident on splunk opened more than say 6h ago?
[10:19:41] <wikibugs>	 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) p:05Triage→03Low
[10:21:28] <wikibugs>	 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) trying `gem2deb facter` we get the following issue   ` dpkg-checkbuilddeps: error: Unmet build dependencies: ruby-hocon (>= 1.3) ruby-thor (<< 2.0) ruby-thor (>= 1.0.1) `  As such we need t...
[10:23:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16535 and previous config saved to /var/cache/conftool/dbconfig/20210616-102349-marostegui.json
[10:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:31] <wikibugs>	 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10Eugene.chernov) Thank you for the change. Now I can see wikimedia.com NS records are over to AWS. https://www.nslookup.io/dns-records/wikimedia.com
[10:32:59] <godog>	 volans: probably not too complex 
[10:34:18] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1007.eqiad.wmnet with reason: REIMAGE
[10:34:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16536 and previous config saved to /var/cache/conftool/dbconfig/20210616-103604-root.json
[10:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:25] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1007.eqiad.wmnet with reason: REIMAGE
[10:36:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:04] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix data structure name [puppet] - 10https://gerrit.wikimedia.org/r/698209
[10:50:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050
[10:51:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16537 and previous config saved to /var/cache/conftool/dbconfig/20210616-105108-root.json
[10:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:44] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050
[10:52:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:54:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:54:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto)
[10:56:08] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050
[10:57:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29895/console" [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto)
[10:59:20] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] bacula/gitlab: add a backup::set for gitlab and use it [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[11:03:52] <wikibugs>	 (03PS1) 10Jbond: cfssl::cert: renew certificates when there are 11 days left [puppet] - 10https://gerrit.wikimedia.org/r/700051
[11:05:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29896/console" [puppet] - 10https://gerrit.wikimedia.org/r/700051 (owner: 10Jbond)
[11:05:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::cert: renew certificates when there are 11 days left [puppet] - 10https://gerrit.wikimedia.org/r/700051 (owner: 10Jbond)
[11:06:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16538 and previous config saved to /var/cache/conftool/dbconfig/20210616-110612-root.json
[11:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:29] <wikibugs>	 (03PS1) 10Jbond: P:tendril::webserver: Ensure apache is automatically refreshed when certs change [puppet] - 10https://gerrit.wikimedia.org/r/700052
[11:15:38] <wikibugs>	 (03PS1) 10Jbond: cfssl::cert: update to the correct amount of seconds [puppet] - 10https://gerrit.wikimedia.org/r/700053
[11:15:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:tendril::webserver: Ensure apache is automatically refreshed when certs change [puppet] - 10https://gerrit.wikimedia.org/r/700052 (owner: 10Jbond)
[11:16:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cfssl::cert: update to the correct amount of seconds [puppet] - 10https://gerrit.wikimedia.org/r/700053 (owner: 10Jbond)
[11:16:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cfssl::cert: update to the correct amount of seconds [puppet] - 10https://gerrit.wikimedia.org/r/700053 (owner: 10Jbond)
[11:20:51] <hnowlan>	 !log running `nodetool cleanup` on maps1005 
[11:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16539 and previous config saved to /var/cache/conftool/dbconfig/20210616-112115-root.json
[11:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto)
[11:40:15] <apergos>	 sleepless in Australia?
[12:00:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131', diff saved to https://phabricator.wikimedia.org/P16540 and previous config saved to /var/cache/conftool/dbconfig/20210616-120015-marostegui.json
[12:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "This will create duplicate backup storage as it is now, we won't have enough storage for this:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[12:00:58] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009
[12:00:59] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps1007.eqiad.wmnet with reason: Reparenting from maps1009
[12:01:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:41] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Also, the default backup policy (weekly full backups) for exports will not be a good match." [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[12:08:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16541 and previous config saved to /var/cache/conftool/dbconfig/20210616-120818-root.json
[12:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:58] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:23:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16543 and previous config saved to /var/cache/conftool/dbconfig/20210616-122322-root.json
[12:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:47] <kormat>	 !log deploying heartbeat service puppet change
[12:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:59] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] mariadb: Automatically manage pt-heartbeat. [puppet] - 10https://gerrit.wikimedia.org/r/699213 (owner: 10Kormat)
[12:36:54] <icinga-wm>	 PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100%
[12:37:04] <icinga-wm>	 PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100%
[12:37:20] <icinga-wm>	 PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100%
[12:38:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16544 and previous config saved to /var/cache/conftool/dbconfig/20210616-123826-root.json
[12:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:30] <icinga-wm>	 PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:38:33] <icinga-wm>	 PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[12:38:43] <godog>	 *siiiigh*, is it the telia link again isn't it ?
[12:38:58] * volans here
[12:38:59] <godog>	 XioNoX topranks ^ what do you think ?
[12:39:01] <XioNoX>	 yo
[12:39:05] <volans>	 let's depool
[12:39:20] <icinga-wm>	 RECOVERY - Host ncredir5001 is UP: PING WARNING - Packet loss = 90%, RTA = 225.30 ms
[12:39:22] <icinga-wm>	 RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 226.46 ms
[12:39:24] <XioNoX>	 ah?
[12:39:26] <icinga-wm>	 RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 230.65 ms
[12:39:31] <XioNoX>	 I was about to kill the telia link
[12:39:33] <wikibugs>	 (03PS1) 10Volans: Revert "Revert "Depool eqsin"" [dns] - 10https://gerrit.wikimedia.org/r/700027
[12:39:40] <volans>	 I've created ^^^
[12:39:46] <icinga-wm>	 RECOVERY - Host ganeti5001 is UP: PING OK - Packet loss = 0%, RTA = 225.51 ms
[12:40:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Revert "Depool eqsin"" [dns] - 10https://gerrit.wikimedia.org/r/700027 (owner: 10Volans)
[12:40:01] <godog>	 volans: yeah +1 
[12:40:01] <volans>	 XioNoX: go ahead
[12:40:14] <volans>	 but you told yesterday better to be depooled without the link
[12:40:15] <icinga-wm>	 RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 224.99 ms
[12:40:16] <volans>	 is that correct?
[12:41:32] * volans waiting for XioNoX given the recovery
[12:41:38] <topranks>	 mtr seems ok
[12:41:46] <XioNoX>	 this one I'm draining it, not killing it
[12:41:57] <topranks>	 https://www.irccloud.com/pastebin/OrHHe8sH/
[12:43:28] <XioNoX>	 alright, telia link is up but traffic is routed over the tunnel link
[12:44:42] <volans>	 XioNoX: do you think we should depool anyway? how much redundancy is left?
[12:45:06] <XioNoX>	 it's fine to keep it pooled for now
[12:45:22] <volans>	 sending most of asia to AMS or SFO ofc is not great but better than failures ;)
[12:45:35] <XioNoX>	 telia is not fully down
[12:45:51] <cdanis>	 if telia goes hard down we'll get a router interfaces down alert and can depool then
[12:45:53] <XioNoX>	 so redundancy is not great, not terrible :)
[12:46:44] <XioNoX>	 I was in the middle of an email to Telia, I guess I'll add that :)
[12:47:30] <godog>	 (I resolved teh incident)
[12:49:28] <wikibugs>	 (03PS1) 10Zabe: Rename Portal and Portal talk namespaces on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868)
[12:49:36] <XioNoX>	 email sent
[12:49:51] <volans>	 thx
[12:50:05] <volans>	 I'll leave the patch there in case is needed
[12:50:06] <XioNoX>	 smokeping didn't pickup the connectivity issue: https://smokeping.wikimedia.org/?target=eqsin
[12:52:47] <jinxer-wm>	 (Traffic on tunnel link) firing: Traffic on tunnel link   - https://alerts.wikimedia.org
[12:53:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16545 and previous config saved to /var/cache/conftool/dbconfig/20210616-125329-root.json
[12:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:39] <wikibugs>	 (03CR) 10Jelto: "> Patch Set 2: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[12:57:47] <jinxer-wm>	 (Traffic on tunnel link) firing: (2) Traffic on tunnel link   - https://alerts.wikimedia.org
[12:59:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 208 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:02:47] <jinxer-wm>	 (Traffic on tunnel link) resolved: Traffic on tunnel link   - https://alerts.wikimedia.org
[13:22:54] <wikibugs>	 (03CR) 10Elukey: Add support for knative serving (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey)
[13:23:01] <wikibugs>	 (03PS6) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194)
[13:26:43] <icinga-wm>	 PROBLEM - Check systemd state on db2151 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "> Patch Set 2: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[13:35:03] <icinga-wm>	 PROBLEM - Check systemd state on db1176 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:12] <marostegui>	 kormat: ^ 
[13:41:19] <marostegui>	 you still testing?
[13:41:43] <kormat>	 i am not
[13:41:56] <marostegui>	 ah those are the new hosts I think
[13:41:59] <kormat>	 what is db1176?
[13:42:02] <marostegui>	 jynus: ^ does that need pt-heartbeat?
[13:42:07] <jynus>	 mmmm
[13:42:11] <marostegui>	 those are mediabackups misc kormat 
[13:42:11] <jynus>	 what are those?
[13:42:13] <kormat>	 i'll ack, you folks figure it out :)
[13:42:25] <marostegui>	 jynus: I believe the mediabackup misc hosts :)
[13:42:27] <jynus>	 I think those are standalone
[13:42:38] <jynus>	 so I guess no pt?
[13:42:43] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on db1176 is CRITICAL: CRITICAL - degraded: The following units failed: pt-heartbeat-wikimedia.service Kormat unsure https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:08] <marostegui>	 jynus: profile::mariadb::mysql_role: 'master' should we remove that?
[13:43:12] <jynus>	 mmmm
[13:43:16] <jynus>	 let me check
[13:43:20] <marostegui>	 sure, no rush
[13:43:37] * marostegui goes back to his all hands session
[13:44:19] <jynus>	 maybe it just needs some grant fixes?
[13:46:10] <jynus>	 failed: Unknown database 'heartbeat'
[13:46:26] <jynus>	 so that is an easy fix
[13:50:12] <jynus>	 interesting, it exists on codfw, so what is failing there?
[13:51:12] <jynus>	 "Data too long for column 'shard' at row 1"
[13:51:14] <jynus>	 lol
[13:51:55] <jynus>	 kormat, marostegui either we make the colum larger or the value ('mediabackupstemp') smaller
[13:52:11] <marostegui>	 hahaha
[13:52:17] <jynus>	 for now I think I will alter just this db
[13:52:20] <marostegui>	 It is probably easier to do the value smaller
[13:52:24] <jynus>	 it is isolated enough it won't be an issue
[13:52:43] <marostegui>	 jynus: +1 for the quick fix anyways
[13:52:45] <jynus>	 it is not mediawiki and will never serve mediawiki traffic without a full reimage
[13:59:27] <icinga-wm>	 RECOVERY - Check systemd state on db2151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:55] <wikibugs>	 (03PS2) 10Zabe: Rename Portal and Portal talk namespaces on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868)
[14:06:40] <wikibugs>	 (03CR) 10Elukey: "This is now ready for review, works on minikube." [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey)
[14:10:37] <icinga-wm>	 RECOVERY - Check systemd state on db1176 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:07] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:20:53] <wikibugs>	 (03PS1) 10Hnowlan: postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071
[14:45:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add support for knative serving (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey)
[14:45:44] <wikibugs>	 (03CR) 10Volans: "I agree with the idea, there is no need to have puppet manage that. Added the other people involved in postgres installations (netbox and " [puppet] - 10https://gerrit.wikimedia.org/r/700071 (owner: 10Hnowlan)
[14:56:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::yaml_defs: fix data structure name [puppet] - 10https://gerrit.wikimedia.org/r/698209 (owner: 10Giuseppe Lavagetto)
[14:58:06] <wikibugs>	 (03PS1) 10Volans: icinga: rename some IcingaHosts methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076
[14:58:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050 (owner: 10Giuseppe Lavagetto)
[14:58:57] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix virtualhost port. [puppet] - 10https://gerrit.wikimedia.org/r/700050
[15:04:03] <wikibugs>	 (03PS1) 10Arlolra: Switch to using parsoid-async for direct VirtualRestClient connects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700077 (https://phabricator.wikimedia.org/T244609)
[15:05:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch to using parsoid-async for direct VirtualRestClient connects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700077 (https://phabricator.wikimedia.org/T244609) (owner: 10Arlolra)
[15:08:02] <wikibugs>	 (03CR) 10Arlolra: Bump envoy timeout for parsoid-php (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/699425 (https://phabricator.wikimedia.org/T244609) (owner: 10Arlolra)
[15:09:18] <wikibugs>	 (03CR) 10David Caro: "Note to self: this breaks the wmcs branch:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076 (owner: 10Volans)
[15:13:03] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076 (owner: 10Volans)
[15:14:18] <wikibugs>	 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) ruby-hocon 1.3 builds fine using the commands from the [[ https://wiki.debian.org/Teams/Ruby/Packaging#Updating_a_package_to_a_newer_version | ruby packaging team ]] i.e  ` $ mr --trust-all...
[15:17:02] <wikibugs>	 (03CR) 10Volans: "This is a proposal to clean a bit the new IcingaHosts API before starting to migrate all the cookbooks to it from the Icinga API in Spicer" [software/spicerack] - 10https://gerrit.wikimedia.org/r/700076 (owner: 10Volans)
[15:22:03] <dancy>	 !log testing upcoming Scap release on beta
[15:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:33] <icinga-wm>	 RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1754872 MB (22% inode=79%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1
[15:40:56] <wikibugs>	 10Puppet, 10Packaging, 10User-jbond: Explore packaging facter 4.0 - https://phabricator.wikimedia.org/T285043 (10jbond) Also following the instructions `https://wiki.debian.org/git-pbuilder#Installing_Extra_Packages` to add the updated ruby-hocon` package i was able to build ruby-facter using the ruby-meta r...
[15:49:21] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:55:41] <wikibugs>	 (03PS1) 10Majavah: metricsinfra: Monitor toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/700082
[16:04:34] <wikibugs>	 (03PS1) 10Jelto: copy latest backup to dedicated folder [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/700084 (https://phabricator.wikimedia.org/T274463)
[16:12:28] <wikibugs>	 (03CR) 10Jelto: "the review in https://gerrit.wikimedia.org/r/c/operations/puppet/+/697850 says that we need to store the latest backup of GitLab for bacul" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/700084 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[16:18:50] <topranks>	 !log Resetting metric on Telia CCT IC-331929, cr1-codfw and cr3-eqsin.
[16:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:27] <topranks>	 ^^ this seems stable other than the blip we had earlier, will try to bring back into service and monitor status.
[16:22:09] <topranks>	 done
[16:27:31] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: drop support for 2.1 in metrics. Fix collector version [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) (owner: 10Hnowlan)
[16:37:11] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:39] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:17] <wikibugs>	 (03PS1) 10Hnowlan: maps: make maps2007 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582)
[17:11:10] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29897/console" [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan)
[17:15:27] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] postgres: fix sync bugs in resync_replica script [puppet] - 10https://gerrit.wikimedia.org/r/699430 (owner: 10Hnowlan)
[17:41:04] <dancy>	 !log Reverted Scap release on beta
[17:41:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:18] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk) > We are tentatively planning a datacenter switchover [...] A date for switching back hasn't been set yet.  We're waiting for the...
[18:35:13] <wikibugs>	 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10srodlund) Fixed!
[18:41:30] <wikibugs>	 (03PS4) 10Bstorm: Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah)
[18:43:58] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "Ok, now I'll merge this and put up a release patch before we try the more "invasive" patch of setting up new routing for grid engine. That" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah)
[18:44:37] <wikibugs>	 (03Merged) 10jenkins-bot: Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah)
[18:59:37] <wikibugs>	 (03PS1) 10Bstorm: d/changelog: Prepare for 0.75 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095
[19:00:49] <wikibugs>	 (03CR) 10Bstorm: "This has fair few changes in it, but we won't know if there are buried problems until release, so...." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm)
[19:14:35] <wikibugs>	 (03CR) 10Majavah: "Do we want to try to get https://phabricator.wikimedia.org/T278748#7153319 in this release as well?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm)
[19:18:05] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:21:26] <wikibugs>	 (03CR) 10Legoktm: "Yay!" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm)
[19:23:38] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm)
[19:24:22] <wikibugs>	 (03CR) 10Bstorm: d/changelog: Prepare for 0.75 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm)
[19:37:17] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10Legoktm) I added the timeline to https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_June_2021_switch, which is:  * Services: Mo...
[19:39:11] <wikibugs>	 (03CR) 10Legoktm: Bump envoy timeout for parsoid-php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699425 (https://phabricator.wikimedia.org/T244609) (owner: 10Arlolra)
[19:41:38] <wikibugs>	 (03CR) 10Legoktm: "So just to clarify:" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[19:56:18] <wikibugs>	 (03CR) 10Bstorm: Use common k8s labels (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm)
[20:18:49] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:08:00] <wikibugs>	 (03PS3) 10H.krishna123: api_db: Add working skeleton code for api_db, add dockerfile [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399)
[21:11:24] <wikibugs>	 (03CR) 10H.krishna123: "Shall we merge this into master branch? (is that the procedure?)" [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[21:32:13] <logmsgbot>	 !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' .
[21:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:08] <logmsgbot>	 !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' .
[21:35:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:38] <wikibugs>	 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) Shellbox is now running in eqiad and codfw: ` legoktm@deploy1002:/srv/deployment-charts/helmfile.d/services/shellbox$ curl https://kubernetes1001.eqiad....
[21:38:58] <wikibugs>	 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm)
[21:41:27] <wikibugs>	 (03PS12) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257)
[21:51:15] <wikibugs>	 (03CR) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui)
[22:10:27] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 204 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:14:13] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 28 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:16:03] <wikibugs>	 10SRE, 10DBA, 10Datacenter-Switchover: Check "Days in advance preparation" for databases before DC switchover - https://phabricator.wikimedia.org/T285069 (10Legoktm)
[22:17:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:21:37] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 26 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:21:43] <wikibugs>	 (03PS2) 10Legoktm: Add shellbox to LVS [puppet] - 10https://gerrit.wikimedia.org/r/693959 (https://phabricator.wikimedia.org/T281423)
[22:21:45] <wikibugs>	 (03PS2) 10Legoktm: service: Switch shellbox to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/693960 (https://phabricator.wikimedia.org/T281423)
[22:21:47] <wikibugs>	 (03PS2) 10Legoktm: service: Switch shellbox to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/693961 (https://phabricator.wikimedia.org/T281423)
[22:21:49] <wikibugs>	 (03PS2) 10Legoktm: service: Switch shellbox to production [puppet] - 10https://gerrit.wikimedia.org/r/693962 (https://phabricator.wikimedia.org/T281423)
[22:34:20] <wikibugs>	 (03CR) 10Bstorm: "This script (based on others I've run) seems to do idempotent-ish things after testing on one namespace and testing across all namespaces " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm)
[22:41:17] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Please close the wmfkids@ mailing list - https://phabricator.wikimedia.org/T284683 (10Legoktm) 05Open→03Resolved a:03Legoktm List closed. We don't tend to delete archives unless there's a good (usually legal) reason to do so.
[22:55:09] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:08:59] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Add link to list archives in default footer - https://phabricator.wikimedia.org/T284256 (10Legoktm) The Wikimedia default footer actually looks like:  ` _______________________________________________ ${display_name} mailing list -- ${listname} List information: https://${doma...
[23:12:59] <icinga-wm>	 RECOVERY - Maps - OSM synchronization lag - codfw on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.736e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1
[23:55:59] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook