[00:01:05] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:55] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 40.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:17:47] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 79.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:56:53] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 144 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:57:19] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 460 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:58:47] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:59:13] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:09:27] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:11:22] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 13 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:55:45] !log start cleaning up auto-review flagged revs logs in plwiki [05:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:52] (03CR) 10Giuseppe Lavagetto: Add configuration for running on kubernetes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [06:28:54] (03CR) 10Legoktm: [C: 03+1] Add configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [06:39:03] PROBLEM - Long running screen/tmux on snapshot1013 is CRITICAL: CRIT: Long running SCREEN process. (user: ariel PID: 42494, 1731414s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [06:39:21] (03PS9) 10KartikMistry: Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [06:39:40] !log installing krb5 security updates [06:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:53] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10MoritzMuehlenhoff) [07:03:12] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10MoritzMuehlenhoff) Cleared my home down to 1.2G. /home in total is down to 30G (and Alex is out), I'm retitling the task to trim the Docker data. [07:04:30] 10SRE, 10serviceops: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10MoritzMuehlenhoff) [07:08:49] !log Optimize dewiki.logging in eqiad (there will be lag) [07:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:48] <_joe_> !log manage-production-images prune on deneb, T287222 [07:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:54] T287222: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 [07:18:29] <_joe_> !log docker-image prune on deneb T287222 [07:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:38] (03CR) 10David Caro: cloud dns: tidy up the labs-ip-alias-dump script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) (owner: 10Bstorm) [07:21:45] (03PS3) 10Volans: decorators: improve the retry decorator [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 [07:22:14] (03CR) 10Volans: "> Took a bit of time to parse everything, but docstrings helped a lot. Some extra simple example in the docstring might help further but n" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [07:22:20] (03Abandoned) 10David Caro: wmcs.labs-ip-alias-dump: add a retry [puppet] - 10https://gerrit.wikimedia.org/r/701515 (https://phabricator.wikimedia.org/T285537) (owner: 10David Caro) [07:30:15] (03CR) 10David Caro: global: add a simple requires.txt (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 (owner: 10David Caro) [07:31:18] (03CR) 10David Caro: "> Patch Set 5: Code-Review+1" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [07:31:21] (03CR) 10Elukey: [C: 03+1] "Perfect thanks :)" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [07:33:27] (03PS3) 10David Caro: global: add a simple requires.txt [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 [07:33:44] dcaro: hi, can we use the CI tox docker job on that operations/debs/prometheus-icinga-exporter repo ? [07:34:11] dcaro: there is the CI job to build the debian package but it only runs when touching files under the ./debian/ directory [07:34:20] hashar: it does not have a tox.ini, so I would not expect it to work [07:34:48] not very familiar with the repo though xd [07:35:55] (03PS1) 10Legoktm: Add tox.ini for CI [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/708031 [07:35:58] hmm, that dir only exists on specific branches (debian/sid) [07:36:16] and it has some upstream/0.x branches [07:37:06] godog: might be able to help there ^ [07:37:48] and seems some code changes are made to the debian/sid branch as well [07:38:17] (03CR) 10Volans: [C: 03+2] decorators: improve the retry decorator [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [07:39:31] we will see :] [07:41:00] (03Merged) 10jenkins-bot: decorators: improve the retry decorator [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [07:41:28] yeah not sure if we can the use the ci job right away but we certainly should [07:41:32] hashar dcaro ^ [07:43:30] oh [07:43:41] so code is done to debian/sid [07:44:05] and fix up get picked up to the master branch (which is also the upstream/0.x one) [07:46:01] not sure what to do there. Some repositories just have tox-docker on the branch holding the source code, and then the debian-glue job for the "debian" integration [07:47:49] (03PS1) 10Volans: decorators: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/708032 [07:48:17] (03CR) 10Elukey: [C: 03+1] decorators: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/708032 (owner: 10Volans) [07:54:10] (03CR) 10Volans: [C: 03+2] decorators: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/708032 (owner: 10Volans) [07:55:42] (03PS1) 10Filippo Giunchedi: pontoon: wait for puppetdb to be up before enabling it [puppet] - 10https://gerrit.wikimedia.org/r/708033 [07:56:55] (03Merged) 10jenkins-bot: decorators: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/708032 (owner: 10Volans) [07:58:59] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add rule to module/profile [puppet] - 10https://gerrit.wikimedia.org/r/706509 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [07:59:04] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: configure thanos rule hosts [puppet] - 10https://gerrit.wikimedia.org/r/706510 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [07:59:07] (03CR) 10Filippo Giunchedi: [C: 03+2] role: activate thanos::rule profile on thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/706511 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [07:59:11] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [07:59:13] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [08:00:22] (03PS4) 10Filippo Giunchedi: prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) [08:00:25] 10SRE, 10Datacenter-Switchover: Add step to rsync home dirs on mwmaint hosts before DC switchover - https://phabricator.wikimedia.org/T287303 (10Volans) I personally disagree, untracked and unreviewed scripts should not run against production IMHO. And we shouldn't encourage and simplify this behaviour but ins... [08:01:18] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [08:01:32] (03PS4) 10Filippo Giunchedi: thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) [08:01:44] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [08:02:42] (03PS1) 10Muehlenhoff: Fix permissions for /usr/sbin/policy-rc.d [puppet] - 10https://gerrit.wikimedia.org/r/708035 [08:04:22] (03PS14) 10Volans: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [08:05:50] (03CR) 10Volans: [C: 03+1] "Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [08:06:24] there might be some thanos alerts firing, that's me [08:06:38] I'm not expecting any, but it is possible [08:07:32] (03CR) 10Muehlenhoff: [C: 03+2] Fix permissions for /usr/sbin/policy-rc.d [puppet] - 10https://gerrit.wikimedia.org/r/708035 (owner: 10Muehlenhoff) [08:10:01] (03PS1) 10Giuseppe Lavagetto: Build the mediawiki-webserver image again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708036 (https://phabricator.wikimedia.org/T285384) [08:10:27] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: The mediawiki-webserver image should only log in json format - https://phabricator.wikimedia.org/T285384 (10Joe) a:05jijikiβ†’03Joe [08:11:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [08:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:59] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ganeti2025, labstore1006, registry2004, thanos-be1003, ganeti2026 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:26:55] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host sretest1001.eqiad.wmnet [08:27:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [08:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:04] (03CR) 10Elukey: [C: 04-1] "After a chat with Giuseppe and Janis: this change is dangerous since knative changes may get applied to all clusters." [deployment-charts] - 10https://gerrit.wikimedia.org/r/707408 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [08:31:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:02] 10SRE, 10Data-Persistence-Backup: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10LSobanski) [08:38:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Build the mediawiki-webserver image again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708036 (https://phabricator.wikimedia.org/T285384) (owner: 10Giuseppe Lavagetto) [08:38:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 5:" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [08:38:23] <_joe_> jouncebot: next [08:38:23] In 1 hour(s) and 51 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T1030) [08:39:02] <_joe_> hashar: given files in the .pipeline directory are not strictly part of mediawiki, I should not deploy them with scap I guess, correct? [08:39:27] (03Merged) 10jenkins-bot: Build the mediawiki-webserver image again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708036 (https://phabricator.wikimedia.org/T285384) (owner: 10Giuseppe Lavagetto) [08:39:28] <_joe_> I just need to fetch the change on the deployment server so that the next deployer doesn't get confused by my patch being there [08:40:09] _joe_: correct :) [08:40:18] I usually just rebase the repo on the deployment server [08:40:24] <_joe_> hashar: ok next q: how do I trigger a rebuild using the pipeline? [08:40:49] <_joe_> because I need to rebuild and publish the image I just uncommented [08:40:53] (03PS3) 10Mvolz: Updated outdated helm commands in NOTES.txt [deployment-charts] - 10https://gerrit.wikimedia.org/r/691599 [08:40:56] 10SRE, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) >>! In T211661#7227857, @dpifke wrote: > Looks like we're already tracking DELETEs, e.g. the second panel in https://... [08:41:39] _joe_: uncommented? [08:41:49] <_joe_> hashar: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708036 [08:42:40] (03PS1) 10JMeybohm: docker_registry_ha::web: Sort the hash of kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/708039 [08:42:41] I don't understand [08:43:12] operations/mediawiki-config does not have any pipeline jobs configured in CI [08:43:23] <_joe_> it's in releases jenkins :) [08:43:45] <_joe_> and I think I don't have access [08:44:45] ah yeah so that would be manually run via https://releases-jenkins.wikimedia.org/ [08:44:59] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30331/console" [puppet] - 10https://gerrit.wikimedia.org/r/708039 (owner: 10JMeybohm) [08:45:09] <_joe_> Oh I thought it was automated somehow [08:45:21] <_joe_> in that case, can I ask you kindly to trigger that rebuild? [08:45:23] which might just poll the git repo. There is a build running [08:45:32] <_joe_> oh ok good [08:45:47] <_joe_> and btw, someone gave me access in hte meanwile \o/ [08:45:51] and we should definitely grant access to that jenkins to whoever is in charge at SRE :D [08:46:09] (03PS1) 10Muehlenhoff: Explicitly document the semantics of debian::autostart for different OSes [puppet] - 10https://gerrit.wikimedia.org/r/708041 [08:46:28] <_joe_> probably the last time I pestered someone for info I was annoying enough to be gifted access :P [08:47:22] seems it is releng group + releng individuals + m.utante and you [08:47:35] all manually filed in the jenkins config :-\ [08:51:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha::web: Sort the hash of kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/708039 (owner: 10JMeybohm) [08:55:26] (03PS1) 10Filippo Giunchedi: hieradata: point cumin_masters in pontoon to cloudinfra hosts [puppet] - 10https://gerrit.wikimedia.org/r/708042 (https://phabricator.wikimedia.org/T287269) [08:58:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, but should get additional review by Traffic SREs." [puppet] - 10https://gerrit.wikimedia.org/r/701545 (owner: 10Jbond) [08:58:13] PROBLEM - Long running screen/tmux on snapshot1010 is CRITICAL: CRIT: Long running SCREEN process. (user: ariel PID: 52407, 1739772s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [09:00:04] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] docker_registry_ha::web: Sort the hash of kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/708039 (owner: 10JMeybohm) [09:00:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, but should get additional review by Traffic SREs." [puppet] - 10https://gerrit.wikimedia.org/r/701539 (owner: 10Jbond) [09:00:37] (03CR) 10Btullis: Deprecate profile::analytics::cluster::users (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [09:01:15] (03PS1) 10Filippo Giunchedi: pontoon: disable ircecho/ircbot [puppet] - 10https://gerrit.wikimedia.org/r/708043 (https://phabricator.wikimedia.org/T287265) [09:05:39] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: disable ircecho/ircbot [puppet] - 10https://gerrit.wikimedia.org/r/708043 (https://phabricator.wikimedia.org/T287265) (owner: 10Filippo Giunchedi) [09:09:29] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:11:22] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10CMacholan) [09:12:31] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10CMacholan) 05Stalledβ†’03Open [09:13:56] (03CR) 10Giuseppe Lavagetto: [C: 04-1] P:tlsproxy::instance: update to use debian::autostart('nginx', false) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701539 (owner: 10Jbond) [09:14:07] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10CMacholan) Thanks for the additional instructions, @RLazarus and @Aklapper. I've updated the "purpose" field in the description above (thanks for adding that template @RLazarus) an... [09:14:56] (03CR) 10Volans: "one typo, I think" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708041 (owner: 10Muehlenhoff) [09:15:23] !log rollback sampling for T286038 [09:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:31] T286038: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 [09:16:50] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Allow to specify ratelimit for dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/708068 (https://phabricator.wikimedia.org/T286054) [09:19:04] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [09:20:14] (03CR) 10Muehlenhoff: Explicitly document the semantics of debian::autostart for different OSes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708041 (owner: 10Muehlenhoff) [09:20:51] (03PS2) 10Muehlenhoff: Explicitly document the semantics of debian::autostart for different OSes [puppet] - 10https://gerrit.wikimedia.org/r/708041 [09:21:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30332/console" [puppet] - 10https://gerrit.wikimedia.org/r/708068 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:21:44] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10Wikidata-Campsite, and 3 others: πŸ›‘ Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Manuel) [09:23:00] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] dragonfly::dfdaemon: Allow to specify ratelimit for dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/708068 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:26:36] (03CR) 10Volans: "Just one thought inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708042 (https://phabricator.wikimedia.org/T287269) (owner: 10Filippo Giunchedi) [09:26:56] (03CR) 10Muehlenhoff: [C: 03+2] Make the RAPI certname configurable [puppet] - 10https://gerrit.wikimedia.org/r/706424 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [09:27:49] (03CR) 10Jgiannelos: "Currently we are testing the staging deployment of tegola. It would be helpful to get some app level metrics to understand how well things" [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 (owner: 10Jgiannelos) [09:34:01] (03CR) 10JMeybohm: [C: 03+1] "Assuming you had nothing listening on tcp/9102 before, this LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 (owner: 10Jgiannelos) [09:37:28] (03CR) 10Jgiannelos: "> Patch Set 3: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 (owner: 10Jgiannelos) [09:37:42] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Enable prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 (owner: 10Jgiannelos) [09:40:38] (03Merged) 10jenkins-bot: tegola-vector-tiles: Enable prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 (owner: 10Jgiannelos) [09:40:50] (03CR) 10Filippo Giunchedi: hieradata: point cumin_masters in pontoon to cloudinfra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708042 (https://phabricator.wikimedia.org/T287269) (owner: 10Filippo Giunchedi) [09:42:37] (03PS1) 10Giuseppe Lavagetto: service::catalog: lower the depool threshold for api [puppet] - 10https://gerrit.wikimedia.org/r/708072 [09:43:32] (03PS1) 10Filippo Giunchedi: icinga: fix ircbot::ensure logic [puppet] - 10https://gerrit.wikimedia.org/r/708073 (https://phabricator.wikimedia.org/T287265) [09:44:41] (03CR) 10Vgutierrez: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/708072 (owner: 10Giuseppe Lavagetto) [09:44:55] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix ircbot::ensure logic [puppet] - 10https://gerrit.wikimedia.org/r/708073 (https://phabricator.wikimedia.org/T287265) (owner: 10Filippo Giunchedi) [09:45:11] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30333/console" [puppet] - 10https://gerrit.wikimedia.org/r/708073 (https://phabricator.wikimedia.org/T287265) (owner: 10Filippo Giunchedi) [09:50:51] (03PS1) 10Giuseppe Lavagetto: mwdebug: refresh the application images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708076 (https://phabricator.wikimedia.org/T285384) [09:54:45] (03PS2) 10Jcrespo: dbbackups: Reimage dbprov1002 to buster [puppet] - 10https://gerrit.wikimedia.org/r/707243 (https://phabricator.wikimedia.org/T287230) [09:55:11] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:56] (03PS2) 10Filippo Giunchedi: icinga: fix ircbot::ensure logic [puppet] - 10https://gerrit.wikimedia.org/r/708073 (https://phabricator.wikimedia.org/T287265) [09:59:21] seeking kind souls to sanity check ^ [10:01:00] can you do the same for alertmanager too? [10:01:42] (03PS1) 10David Caro: ceph: fix a typo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/708078 [10:01:44] (03PS1) 10David Caro: ceph: Added tests to CephOSDController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/708079 [10:01:51] (03PS4) 10Muehlenhoff: Add ganeti2025/2026 to Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/706315 (https://phabricator.wikimedia.org/T286206) [10:02:04] godog: looking [10:02:27] thanks moritzm [10:03:15] majavah: not necessary, the irc bot for am already DTRT and uses a different account [10:03:42] I'm doing the minimum for ircecho/the whole legacy stack though [10:04:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Explicitly document the semantics of debian::autostart for different OSes [puppet] - 10https://gerrit.wikimedia.org/r/708041 (owner: 10Muehlenhoff) [10:05:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/708073 (https://phabricator.wikimedia.org/T287265) (owner: 10Filippo Giunchedi) [10:05:20] DTRT? [10:05:35] does the right thing :) [10:06:05] ah, great [10:06:20] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: fix ircbot::ensure logic [puppet] - 10https://gerrit.wikimedia.org/r/708073 (https://phabricator.wikimedia.org/T287265) (owner: 10Filippo Giunchedi) [10:14:42] (03PS1) 10Filippo Giunchedi: icinga: remove grafana alerts for Traffic, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/708081 (https://phabricator.wikimedia.org/T282806) [10:17:19] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:21:29] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708041 (owner: 10Muehlenhoff) [10:22:32] downtiming ml-serve nodes [10:24:54] (03CR) 10Jbond: [C: 04-1] "ill mark this as WIP for now, i agree with joe preventing start on a reboot is do undesirable" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701539 (owner: 10Jbond) [10:25:50] (03PS3) 10David Caro: ceph: add latency monitoring stats [puppet] - 10https://gerrit.wikimedia.org/r/700182 (https://phabricator.wikimedia.org/T281254) [10:26:56] (03PS3) 10Volans: decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) [10:27:07] (03CR) 10Volans: "addressed comments" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [10:27:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: refresh the application images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708076 (https://phabricator.wikimedia.org/T285384) (owner: 10Giuseppe Lavagetto) [10:30:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T1030). [10:31:09] (03Merged) 10jenkins-bot: mwdebug: refresh the application images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708076 (https://phabricator.wikimedia.org/T285384) (owner: 10Giuseppe Lavagetto) [10:32:12] (03CR) 10jerkins-bot: [V: 04-1] decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [10:32:25] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:33:20] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:46] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is CRITICAL: 0.08468 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:33:47] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at codfw #page on alert1001 is CRITICAL: 0.005693 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver [10:34:07] * volans here, acked [10:34:13] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [10:34:15] <_joe_> uh [10:34:25] <_joe_> anyone else around? [10:34:29] is it a real issue or just an artifact? [10:34:34] I am [10:34:35] <_joe_> of what/ [10:34:39] PROBLEM - Apache HTTP on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:41] PROBLEM - PHP7 rendering on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:45] PROBLEM - PHP7 rendering on mw2363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:47] PROBLEM - Apache HTTP on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:48] uh [10:34:49] PROBLEM - PHP7 rendering on mw2371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:49] taking a long time to load grafana here [10:34:51] PROBLEM - PHP7 rendering on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:53] PROBLEM - Apache HTTP on mw2305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:53] PROBLEM - PHP7 rendering on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:53] PROBLEM - PHP7 rendering on mw2355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:53] PROBLEM - PHP7 rendering on mw2392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:55] PROBLEM - Apache HTTP on mw2357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:55] PROBLEM - Apache HTTP on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:55] PROBLEM - Apache HTTP on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:55] PROBLEM - PHP7 rendering on mw2388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:57] PROBLEM - PHP7 rendering on mw2277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:57] PROBLEM - PHP7 rendering on mw2365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:59] PROBLEM - Apache HTTP on mw2303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:34:59] PROBLEM - PHP7 rendering on mw2357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:34:59] PROBLEM - PHP7 rendering on mw2369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:02] <_joe_> I guess it's a problem with the db or something [10:35:03] PROBLEM - Apache HTTP on mw2276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:03] PROBLEM - Apache HTTP on mw2310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:03] PROBLEM - PHP7 rendering on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:05] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2248 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:35:05] PROBLEM - PHP7 rendering on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:05] PROBLEM - Apache HTTP on mw2268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:05] PROBLEM - phpfpm_up reduced availability on alert1001 is CRITICAL: 0.6449 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:35:07] PROBLEM - Apache HTTP on mw2334 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:07] PROBLEM - PHP7 rendering on mw2291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:07] PROBLEM - Apache HTTP on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:09] PROBLEM - PHP7 rendering on mw2373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:11] PROBLEM - PHP7 rendering on mw2257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:11] PROBLEM - PHP7 rendering on mw2272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:12] PROBLEM - PHP7 rendering on mw2271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:12] PROBLEM - PHP7 rendering on mw2353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:12] PROBLEM - Apache HTTP on mw2272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:12] PROBLEM - Apache HTTP on mw2373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:12] I am checking the dbs [10:35:13] PROBLEM - PHP7 rendering on mw2275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:13] PROBLEM - Apache HTTP on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:15] PROBLEM - PHP7 rendering on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:15] PROBLEM - PHP7 rendering on mw2359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:15] PROBLEM - PHP7 rendering on mw2360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:17] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.9692 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:35:17] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:35:19] PROBLEM - PHP7 rendering on mw2375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:21] PROBLEM - Apache HTTP on mw2408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:21] PROBLEM - PHP7 rendering on mw2299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:22] PROBLEM - Apache HTTP on mw2326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:22] PROBLEM - Apache HTTP on mw2401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:22] PROBLEM - Apache HTTP on mw2296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:22] PROBLEM - Apache HTTP on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:25] PROBLEM - Apache HTTP on mw2385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:25] PROBLEM - Apache HTTP on mw2313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:25] PROBLEM - Apache HTTP on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:25] PROBLEM - PHP7 rendering on mw2323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:25] PROBLEM - Apache HTTP on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:26] PROBLEM - Apache HTTP on mw2271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:26] PROBLEM - PHP7 rendering on mw2317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:27] PROBLEM - PHP7 rendering on mw2356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:27] PROBLEM - Apache HTTP on mw2387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:28] PROBLEM - PHP7 rendering on mw2292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:28] PROBLEM - PHP7 rendering on mw2327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:29] PROBLEM - Apache HTTP on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:29] PROBLEM - Apache HTTP on mw2355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:30] PROBLEM - Apache HTTP on mw2333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:30] PROBLEM - PHP7 rendering on mw2337 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:31] PROBLEM - Apache HTTP on mw2365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:31] PROBLEM - Apache HTTP on mw2391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:32] PROBLEM - PHP7 rendering on mw2352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:32] PROBLEM - Apache HTTP on mw2389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:33] PROBLEM - Apache HTTP on mw2311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:33] PROBLEM - Apache HTTP on mw2361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:33] <_joe_> yes [10:35:34] PROBLEM - PHP7 rendering on mw2303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:34] PROBLEM - Apache HTTP on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:35] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POS [10:35:35] PROBLEM - PHP7 rendering on mw2368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:35] <_joe_> [0x00007fde2281e720] query() /srv/mediawiki/php-1.37.0-wmf.15/includes/libs/rdbms/database/DatabaseMysqli.php:46 [10:35:36] PROBLEM - PHP7 rendering on mw2305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:36] PROBLEM - Apache HTTP on mw2386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:37] PROBLEM - Apache HTTP on mw2301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:37] PROBLEM - PHP7 rendering on mw2384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:38] PROBLEM - PHP7 rendering on mw2397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:38] PROBLEM - PHP7 rendering on mw2306 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:39] PROBLEM - Apache HTTP on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:39] PROBLEM - PHP7 rendering on mw2309 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:39] oh god [10:35:40] PROBLEM - Apache HTTP on mw2302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:40] PROBLEM - PHP7 rendering on mw2315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:41] PROBLEM - PHP7 rendering on mw2333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:41] PROBLEM - PHP7 rendering on mw2402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:42] PROBLEM - Apache HTTP on mw2294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:42] PROBLEM - Apache HTTP on mw2306 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:43] PROBLEM - PHP7 rendering on mw2386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:43] PROBLEM - PHP7 rendering on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:43] let's move to -sre [10:35:44] PROBLEM - Apache HTTP on mw2400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:44] PROBLEM - PHP7 rendering on mw2354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:45] PROBLEM - PHP7 rendering on mw2318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:45] PROBLEM - PHP7 rendering on mw2401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:46] PROBLEM - Apache HTTP on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:46] PROBLEM - Apache HTTP on mw2277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:47] PROBLEM - Apache HTTP on mw2384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:47] PROBLEM - Apache HTTP on mw2304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:48] PROBLEM - PHP7 rendering on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:48] PROBLEM - PHP7 rendering on mw2396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:49] PROBLEM - PHP7 rendering on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:49] PROBLEM - PHP7 rendering on mw2334 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:50] PROBLEM - Apache HTTP on mw2407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:50] PROBLEM - PHP7 rendering on mw2320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:51] PROBLEM - Apache HTTP on mw2319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:51] PROBLEM - PHP7 rendering on mw2361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:52] PROBLEM - Apache HTTP on mw2371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:52] PROBLEM - Apache HTTP on mw2363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:53] PROBLEM - Apache HTTP on mw2257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:53] PROBLEM - PHP7 rendering on mw2405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:54] PROBLEM - PHP7 rendering on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:54] PROBLEM - Apache HTTP on mw2335 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:55] PROBLEM - Apache HTTP on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:55] PROBLEM - PHP7 rendering on mw2270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:56] PROBLEM - Apache HTTP on mw2318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:56] PROBLEM - PHP7 rendering on mw2407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:57] PROBLEM - Apache HTTP on mw2292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:35:57] PROBLEM - PHP7 rendering on mw2393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:35:58] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [10:35:58] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [10:35:59] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [10:35:59] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [10:36:00] PROBLEM - Apache HTTP on mw2404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:00] PROBLEM - Apache HTTP on mw2396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:01] PROBLEM - PHP7 rendering on mw2391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:01] PROBLEM - PHP7 rendering on mw2311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:02] PROBLEM - Apache HTTP on mw2338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:02] PROBLEM - Apache HTTP on mw2327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:03] PROBLEM - Apache HTTP on mw2375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:03] PROBLEM - Apache HTTP on mw2402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:04] PROBLEM - PHP7 rendering on mw2328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:04] PROBLEM - Apache HTTP on mw2290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:05] PROBLEM - PHP7 rendering on mw2325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:05] PROBLEM - PHP7 rendering on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:06] PROBLEM - PHP7 rendering on mw2307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:06] PROBLEM - Apache HTTP on mw2390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:07] PROBLEM - PHP7 rendering on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:07] PROBLEM - Apache HTTP on mw2269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:08] PROBLEM - PHP7 rendering on mw2408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:08] PROBLEM - PHP7 rendering on mw2406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:09] PROBLEM - PHP7 rendering on mw2302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:36:09] PROBLEM - Apache HTTP on mw2284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:36:22] the bot got kicked [10:36:24] lol [10:36:37] <_joe_> yes we know we're down [10:36:54] we're back [10:38:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2149', diff saved to https://phabricator.wikimedia.org/P16892 and previous config saved to /var/cache/conftool/dbconfig/20210726-103847-marostegui.json [10:38:53] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:54] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:54] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:54] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:54] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:54] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:55] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:39:12] RECOVERY - phpfpm_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:39:16] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:39:21] <_joe_> marostegui: seems like your intervention worked [10:39:37] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:39:42] marostegui: <3 [10:39:44] <_joe_> still seeing logs for slow queries though [10:39:52] <_joe_> but I think it's leftovers [10:40:00] <_joe_> elukey: can you check how pybal is seeing the appservers? [10:40:07] RECOVERY - High average POST latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [10:40:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:30] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [10:40:36] <_joe_> I'm not sure we're out fo the woods though [10:40:55] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.8594 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:41:04] <_joe_> yup as I said [10:41:16] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:41:25] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:41:29] <_joe_> I still see a lot of errors connecting to the databases [10:41:48] upstream connect error or disconnect/reset before headers. reset reason: overflow [10:41:52] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is CRITICAL: 0.1211 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:41:53] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at codfw #page on alert1001 is CRITICAL: 0.01534 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver [10:42:39] PROBLEM - Apache HTTP on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:42:59] PROBLEM - PHP7 rendering on mw2270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:42:59] PROBLEM - PHP7 rendering on mw2407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:00] PROBLEM - PHP7 rendering on mw2393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:01] PROBLEM - Apache HTTP on mw2396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:02] PROBLEM - Apache HTTP on mw2404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:03] PROBLEM - Apache HTTP on mw2338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:03] PROBLEM - Apache HTTP on mw2375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:03] PROBLEM - Apache HTTP on mw2327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:03] PROBLEM - PHP7 rendering on mw2311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:03] PROBLEM - PHP7 rendering on mw2325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:04] PROBLEM - Apache HTTP on mw2390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:04] PROBLEM - PHP7 rendering on mw2307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:04] PROBLEM - Apache HTTP on mw2290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:05] PROBLEM - PHP7 rendering on mw2328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:05] PROBLEM - PHP7 rendering on mw2391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:06] PROBLEM - PHP7 rendering on mw2408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:06] PROBLEM - Apache HTTP on mw2269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:07] PROBLEM - PHP7 rendering on mw2406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:07] PROBLEM - Apache HTTP on mw2270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:08] PROBLEM - Apache HTTP on mw2367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:08] PROBLEM - PHP7 rendering on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:09] PROBLEM - PHP7 rendering on mw2273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:10] PROBLEM - PHP7 rendering on mw2301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:12] PROBLEM - Apache HTTP on mw2315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:13] PROBLEM - phpfpm_up reduced availability on alert1001 is CRITICAL: 0.6477 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:43:15] PROBLEM - Apache HTTP on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:15] PROBLEM - PHP7 rendering on mw2362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:23] PROBLEM - Apache HTTP on mw2275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:23] PROBLEM - PHP7 rendering on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:25] PROBLEM - PHP7 rendering on mw2331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:25] PROBLEM - PHP7 rendering on mw2310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:25] PROBLEM - Apache HTTP on mw2393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:25] PROBLEM - PHP7 rendering on mw2314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:26] PROBLEM - PHP7 rendering on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:27] PROBLEM - PHP7 rendering on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:27] PROBLEM - Apache HTTP on mw2331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:27] PROBLEM - PHP7 rendering on mw2297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:27] PROBLEM - Apache HTTP on mw2392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:28] PROBLEM - Apache HTTP on mw2299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:28] PROBLEM - PHP7 rendering on mw2319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:28] PROBLEM - Apache HTTP on mw2405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:29] PROBLEM - Apache HTTP on mw2397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:29] PROBLEM - PHP7 rendering on mw2268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:30] PROBLEM - Apache HTTP on mw2359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:30] PROBLEM - PHP7 rendering on mw2335 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:31] PROBLEM - PHP7 rendering on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:31] PROBLEM - Apache HTTP on mw2298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:32] PROBLEM - Apache HTTP on mw2322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:32] PROBLEM - PHP7 rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:35] PROBLEM - PHP7 rendering on mw2367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:35] PROBLEM - PHP7 rendering on mw2269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:37] PROBLEM - Apache HTTP on mw2369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:37] PROBLEM - PHP7 rendering on mw2372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:37] PROBLEM - PHP7 rendering on mw2312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:37] PROBLEM - PHP7 rendering on mw2329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:38] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POS [10:43:39] PROBLEM - PHP7 rendering on mw2358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:39] PROBLEM - PHP7 rendering on mw2283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:40] PROBLEM - PHP7 rendering on mw2308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:40] PROBLEM - PHP7 rendering on mw2298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:40] PROBLEM - Apache HTTP on mw2307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:40] PROBLEM - Apache HTTP on mw2358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:40] PROBLEM - Apache HTTP on mw2288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response [10:43:41] ived: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/ [10:43:42] ile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a r [10:43:42] was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:43:43] PROBLEM - PHP7 rendering on mw2389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:43] PROBLEM - Apache HTTP on mw2273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:45] PROBLEM - Apache HTTP on mw2300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:45] PROBLEM - Apache HTTP on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:46] PROBLEM - Apache HTTP on mw2406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:46] PROBLEM - PHP7 rendering on mw2385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:47] PROBLEM - Apache HTTP on mw2309 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:47] PROBLEM - PHP7 rendering on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:47] PROBLEM - Apache HTTP on mw2353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:48] PROBLEM - PHP7 rendering on mw2313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:48] PROBLEM - Apache HTTP on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:49] PROBLEM - Apache HTTP on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:49] PROBLEM - PHP7 rendering on mw2326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:51] PROBLEM - Apache HTTP on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:52] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media [10:43:52] m test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returne [10:43:52] expected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} ( [10:43:52] iew mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:43:53] PROBLEM - PHP7 rendering on mw2398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:53] PROBLEM - Apache HTTP on mw2324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:53] PROBLEM - PHP7 rendering on mw2363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:54] PROBLEM - Apache HTTP on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:55] PROBLEM - PHP7 rendering on mw2296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:55] PROBLEM - PHP7 rendering on mw2324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:55] PROBLEM - PHP7 rendering on mw2371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:57] PROBLEM - Apache HTTP on mw2291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:58] PROBLEM - PHP7 rendering on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:58] PROBLEM - PHP7 rendering on mw2390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:58] PROBLEM - PHP7 rendering on mw2300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:58] PROBLEM - Apache HTTP on mw2308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:58] PROBLEM - Apache HTTP on mw2297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:59] PROBLEM - Apache HTTP on mw2305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:43:59] PROBLEM - PHP7 rendering on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:43:59] PROBLEM - PHP7 rendering on mw2355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:00] PROBLEM - PHP7 rendering on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:01] PROBLEM - PHP7 rendering on mw2392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:02] PROBLEM - Apache HTTP on mw2357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:02] PROBLEM - Apache HTTP on mw2351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:02] PROBLEM - Apache HTTP on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:03] PROBLEM - PHP7 rendering on mw2294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:03] PROBLEM - Apache HTTP on mw2398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:04] PROBLEM - PHP7 rendering on mw2388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:04] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [10:44:05] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [10:44:05] PROBLEM - PHP7 rendering on mw2277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:06] PROBLEM - Apache HTTP on mw2314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:06] PROBLEM - PHP7 rendering on mw2365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:07] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [10:44:07] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [10:44:08] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:44:08] PROBLEM - Apache HTTP on mw2303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:09] PROBLEM - Apache HTTP on mw2362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:09] PROBLEM - PHP7 rendering on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:10] PROBLEM - Apache HTTP on mw2312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:10] PROBLEM - PHP7 rendering on mw2403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:11] PROBLEM - PHP7 rendering on mw2357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:11] PROBLEM - PHP7 rendering on mw2369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:12] PROBLEM - PHP7 rendering on mw2338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:12] PROBLEM - Apache HTTP on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:13] PROBLEM - PHP7 rendering on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:13] PROBLEM - PHP7 rendering on mw2400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:14] PROBLEM - Apache HTTP on mw2310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:14] PROBLEM - PHP7 rendering on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:15] PROBLEM - Apache HTTP on mw2276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:15] PROBLEM - PHP7 rendering on mw2290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:44:16] PROBLEM - Apache HTTP on mw2283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:16] PROBLEM - High average POST latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [10:44:17] PROBLEM - Apache HTTP on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:44:41] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T287362 (10Peachey88) [10:45:42] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T287362 (10RhinosF1) p:05Triageβ†’03Unbreak! [10:45:43] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [10:45:45] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw2284.codfw.wmnet, mw2350.codfw.wmnet, mw2396.codfw.wmnet, mw2286.codfw.wmnet, mw2302.codfw.wmnet, mw2261.codfw.wmnet, mw2360.codfw.wmnet, mw2326.codfw.wmnet, mw2298.codfw.wmnet, mw2288.codfw.wmnet, mw2364.codfw.wmnet, mw2308.codfw.wmnet, mw2321.codfw.wmnet, mw2294.codfw.wmnet, mw2253.codfw.wmnet, mw2356.codfw.wmnet, mw2324.codfw.wmn [10:45:45] 96.codfw.wmnet, mw2283.codfw.wmnet, mw2358.codfw.wmnet, mw2328.codfw.wmnet, mw2398.codfw.wmnet, mw2320.codfw.wmnet, mw2370.codfw.wmnet, mw2368.codfw.wmnet, mw2330.codfw.wmnet, mw2289.codfw.wmnet, mw2306.codfw.wmnet, mw2352.codfw.wmnet, mw2400.codfw.wmnet, mw2405.codfw.wmnet, mw2297.codfw.wmnet, mw2354.codfw.wmnet, mw2295.codfw.wmnet, mw2399.codfw.wmnet, mw2293.codfw.wmnet, mw2401.codfw.wmnet, mw2317.codfw.wmnet, mw2291.codfw.wmnet, mw2322 [10:45:45] mnet, mw2362.codfw.wmnet, mw2402.codfw.wmnet, mw2332.codfw.wmnet, mw2287.codfw.wmnet, mw2403.codfw.wmnet, mw2366.codfw.wmnet, mw2319.codfw.wmnet, mw2285.codfw.wmnet, mw2374.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [10:45:47] PROBLEM - Apache HTTP on mw2325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:45:47] PROBLEM - Apache HTTP on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:45:48] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T287362 (10Joe) This is a known ongoing issue with a database overload, please be patient while we work on a resolution [10:45:49] PROBLEM - Apache HTTP on mw2328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:45:49] PROBLEM - PHP7 rendering on mw2304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:45:51] PROBLEM - Apache HTTP on mw2403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:45:52] PROBLEM - PHP7 rendering on mw2288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:45:59] PROBLEM - Apache HTTP on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:45:59] PROBLEM - PHP7 rendering on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:46:02] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:46:03] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:05] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:07] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:07] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:13] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T287362 (10alaa) [10:46:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2149', diff saved to https://phabricator.wikimedia.org/P16893 and previous config saved to /var/cache/conftool/dbconfig/20210726-104613-marostegui.json [10:46:17] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:19] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:46:23] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:45] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:45] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:46:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2149', diff saved to https://phabricator.wikimedia.org/P16894 and previous config saved to /var/cache/conftool/dbconfig/20210726-104649-marostegui.json [10:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:05] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:47:06] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T287362 (10Joe) Specifically, it seems to be a recurrence of T262240 [10:47:17] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:47:20] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:47:25] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Jdforrester-WMF) [10:47:32] RECOVERY - Apache HTTP on mw2299 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.631 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:47:45] RECOVERY - PHP7 rendering on mw2304 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.880 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:47:57] RECOVERY - PHP7 rendering on mw2398 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.190 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:47:59] RECOVERY - Apache HTTP on mw2295 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.575 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:47:59] RECOVERY - PHP7 rendering on mw2295 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.591 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:03] RECOVERY - PHP7 rendering on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.419 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:05] RECOVERY - Apache HTTP on mw2398 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:07] RECOVERY - PHP7 rendering on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.748 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:07] RECOVERY - PHP7 rendering on mw2400 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.993 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:09] RECOVERY - PHP7 rendering on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.866 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:13] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:48:15] RECOVERY - Apache HTTP on mw2320 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.601 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:15] RECOVERY - PHP7 rendering on mw2332 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.658 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:15] RECOVERY - Apache HTTP on mw2334 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.437 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:17] RECOVERY - PHP7 rendering on mw2284 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.600 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:17] RECOVERY - PHP7 rendering on mw2291 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.430 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:17] RECOVERY - Apache HTTP on mw2317 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.250 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:19] RECOVERY - PHP7 rendering on mw2287 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.366 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:19] RECOVERY - Apache HTTP on mw2272 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:19] RECOVERY - PHP7 rendering on mw2272 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:19] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:19] RECOVERY - PHP7 rendering on mw2271 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.409 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:21] RECOVERY - Apache HTTP on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.792 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:25] RECOVERY - PHP7 rendering on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.166 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:27] RECOVERY - PHP7 rendering on mw2299 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.221 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:27] RECOVERY - Apache HTTP on mw2287 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.349 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:31] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 2104 bytes in 5.152 second response time https://phabricator.wikimedia.org/project/view/71/ [10:48:32] RECOVERY - PHP7 rendering on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.954 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:32] RECOVERY - Apache HTTP on mw2350 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:32] RECOVERY - Apache HTTP on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.417 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:32] RECOVERY - PHP7 rendering on mw2356 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.378 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:33] RECOVERY - PHP7 rendering on mw2352 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.381 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:33] RECOVERY - PHP7 rendering on mw2317 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.546 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:34] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [10:48:35] RECOVERY - Apache HTTP on mw2271 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.609 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:35] RECOVERY - Apache HTTP on mw2286 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.900 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:35] RECOVERY - PHP7 rendering on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.222 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:37] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:37] RECOVERY - Apache HTTP on mw2261 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.976 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:37] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:39] RECOVERY - Apache HTTP on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.362 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:39] RECOVERY - PHP7 rendering on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.387 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:39] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:39] RECOVERY - PHP7 rendering on mw2397 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.796 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:41] RECOVERY - PHP7 rendering on mw2368 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.869 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:42] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:42] RECOVERY - Apache HTTP on mw2302 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:42] RECOVERY - Apache HTTP on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.992 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:45] RECOVERY - Apache HTTP on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:45] RECOVERY - Apache HTTP on mw2253 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:45] RECOVERY - PHP7 rendering on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:45] RECOVERY - PHP7 rendering on mw2402 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.487 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:45] RECOVERY - PHP7 rendering on mw2354 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:45] RECOVERY - Apache HTTP on mw2294 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.692 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:46] RECOVERY - Apache HTTP on mw2301 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.763 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:46] RECOVERY - PHP7 rendering on mw2384 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:47] RECOVERY - PHP7 rendering on mw2334 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:47] RECOVERY - PHP7 rendering on mw2396 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:48] RECOVERY - Apache HTTP on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:48] RECOVERY - PHP7 rendering on mw2320 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:49] RECOVERY - PHP7 rendering on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:49] RECOVERY - Apache HTTP on mw2304 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:50] RECOVERY - PHP7 rendering on mw2401 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.441 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:50] RECOVERY - Apache HTTP on mw2400 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.745 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:52] RECOVERY - PHP7 rendering on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.325 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:48:55] RECOVERY - Apache HTTP on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:59] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:59] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:48:59] RECOVERY - Apache HTTP on mw2404 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:59] RECOVERY - Apache HTTP on mw2396 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:48:59] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:00] RECOVERY - Apache HTTP on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:00] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:01] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:01] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:02] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:02] RECOVERY - Apache HTTP on mw2290 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.405 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:03] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:03] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:04] RECOVERY - Apache HTTP on mw2402 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:04] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:49:05] RECOVERY - PHP7 rendering on mw2350 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.526 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:05] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:49:06] RECOVERY - PHP7 rendering on mw2302 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:06] RECOVERY - Apache HTTP on mw2284 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:07] RECOVERY - PHP7 rendering on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.551 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:07] RECOVERY - Apache HTTP on mw2352 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:08] RECOVERY - PHP7 rendering on mw2321 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:08] RECOVERY - Apache HTTP on mw2354 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:09] RECOVERY - Apache HTTP on mw2327 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:09] RECOVERY - Apache HTTP on mw2338 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.530 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:11] RECOVERY - Apache HTTP on mw2356 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:12] RECOVERY - Apache HTTP on mw2372 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.436 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:12] RECOVERY - Apache HTTP on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.494 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:12] RECOVERY - PHP7 rendering on mw2362 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:12] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:49:13] RECOVERY - PHP7 rendering on mw2406 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.352 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:13] RECOVERY - PHP7 rendering on mw2408 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.745 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:13] RECOVERY - Apache HTTP on mw2367 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.994 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:15] RECOVERY - PHP7 rendering on mw2261 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:15] RECOVERY - PHP7 rendering on mw2301 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.016 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:19] RECOVERY - PHP7 rendering on mw2289 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:21] RECOVERY - PHP7 rendering on mw2253 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:22] RECOVERY - Apache HTTP on mw2409 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.870 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:22] RECOVERY - PHP7 rendering on mw2322 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.524 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:23] RECOVERY - PHP7 rendering on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:23] RECOVERY - PHP7 rendering on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:24] RECOVERY - Apache HTTP on mw2397 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:49:25] RECOVERY - Apache HTTP on mw2322 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.876 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:25] RECOVERY - Apache HTTP on mw2405 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.730 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:26] RECOVERY - PHP7 rendering on mw2404 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:26] RECOVERY - Apache HTTP on mw2321 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:27] RECOVERY - Apache HTTP on mw2298 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.405 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:27] RECOVERY - PHP7 rendering on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:35] RECOVERY - PHP7 rendering on mw2358 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:35] RECOVERY - Apache HTTP on mw2358 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:36] RECOVERY - Apache HTTP on mw2288 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:41] RECOVERY - PHP7 rendering on mw2372 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.653 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:42] RECOVERY - PHP7 rendering on mw2283 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.313 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:42] RECOVERY - Apache HTTP on mw2328 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:42] RECOVERY - Apache HTTP on mw2300 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:43] RECOVERY - PHP7 rendering on mw2288 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.890 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:45] RECOVERY - Apache HTTP on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.215 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:45] RECOVERY - Apache HTTP on mw2285 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:49] RECOVERY - Apache HTTP on mw2332 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.610 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:51] RECOVERY - Apache HTTP on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.017 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:52] RECOVERY - Apache HTTP on mw2324 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.885 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:52] RECOVERY - PHP7 rendering on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:52] RECOVERY - PHP7 rendering on mw2324 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:54] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24425 bytes in 0.785 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:49:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2149', diff saved to https://phabricator.wikimedia.org/P16895 and previous config saved to /var/cache/conftool/dbconfig/20210726-104953-marostegui.json [10:49:55] RECOVERY - Apache HTTP on mw2291 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:55] RECOVERY - Apache HTTP on mw2308 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:57] RECOVERY - Apache HTTP on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.430 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:57] PROBLEM - PHP7 rendering on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:58] RECOVERY - PHP7 rendering on mw2300 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.246 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:59] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:59] RECOVERY - PHP7 rendering on mw2262 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:50:03] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:05] RECOVERY - Apache HTTP on mw2362 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:09] RECOVERY - Apache HTTP on mw2283 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:11] RECOVERY - Apache HTTP on mw2289 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.882 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:12] RECOVERY - PHP7 rendering on mw2290 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:50:13] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:17] RECOVERY - Apache HTTP on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.247 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:17] RECOVERY - PHP7 rendering on mw2285 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.452 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:50:19] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:50:19] RECOVERY - Apache HTTP on mw2366 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.727 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:19] RECOVERY - PHP7 rendering on mw2366 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.819 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:50:20] RECOVERY - Apache HTTP on mw2262 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.543 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:25] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:50:27] RECOVERY - Apache HTTP on mw2401 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.584 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:31] RECOVERY - Apache HTTP on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:35] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:50:37] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:50:37] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:50:39] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:39] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:39] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:42] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:42] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:43] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:43] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:52] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [10:50:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:56] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Peachey88) [10:50:59] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:01] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:02] RECOVERY - Apache HTTP on mw2257 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.479 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:51:02] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:02] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:02] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:02] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:51:35] RECOVERY - PHP7 rendering on mw2298 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:51:41] RECOVERY - PHP7 rendering on mw2308 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:51:50] !log deploying 10 second mw user query limit on s3 codfw replicas [10:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:01] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:52:07] !log ladsgroup@deploy1002 Scap failed!: 2/6 canaries failed their endpoint checks(https://en.wikipedia.org) [10:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:52:34] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [10:52:52] Trying again [10:53:09] PROBLEM - PHP7 rendering on mw2400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:53:18] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:53:19] PROBLEM - PHP7 rendering on mw2284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:53:20] !log ladsgroup@deploy1002 Scap failed!: 3/6 canaries failed their endpoint checks(https://en.wikipedia.org) [10:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:32] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) Oh, my gosh. [10:53:37] PROBLEM - PHP7 rendering on mw2317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:53:47] PROBLEM - Apache HTTP on mw2301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:53:47] PROBLEM - PHP7 rendering on mw2384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:53:55] PROBLEM - PHP7 rendering on mw2386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:09] PROBLEM - Apache HTTP on mw2327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:54:09] PROBLEM - Apache HTTP on mw2338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:54:13] PROBLEM - PHP7 rendering on mw2408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:13] PROBLEM - PHP7 rendering on mw2406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:15] PROBLEM - PHP7 rendering on mw2301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:21] PROBLEM - Apache HTTP on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:54:25] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:32] PROBLEM - Apache HTTP on mw2405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:54:32] PROBLEM - PHP7 rendering on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:42] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:42] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:47] PROBLEM - PHP7 rendering on mw2304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:49] PROBLEM - Apache HTTP on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:54:55] PROBLEM - Apache HTTP on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:54:55] PROBLEM - PHP7 rendering on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:57] PROBLEM - PHP7 rendering on mw2296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:54:58] upstream connect error or disconnect/reset before headers. reset reason: overflow [10:55:01] PROBLEM - Apache HTTP on mw2308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:02] PROBLEM - PHP7 rendering on mw2300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:02] RECOVERY - PHP7 rendering on mw2391 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.743 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:02] PROBLEM - PHP7 rendering on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:03] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:55:03] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:55:04] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Disable DPL on ruwikinews (duration: 00m 27s) [10:55:07] PROBLEM - PHP7 rendering on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:07] PROBLEM - Apache HTTP on mw2362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:07] PROBLEM - PHP7 rendering on mw2403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:08] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Peachey88) @IN Comments like that and +1 style comments don't help, Please try to leave only constructive comments that help a... [10:55:09] PROBLEM - PHP7 rendering on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:09] PROBLEM - Apache HTTP on mw2283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:09] PROBLEM - PHP7 rendering on mw2290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:15] PROBLEM - PHP7 rendering on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:15] PROBLEM - Apache HTTP on mw2334 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:17] PROBLEM - PHP7 rendering on mw2271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:17] PROBLEM - Apache HTTP on mw2323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:18] deployed [10:55:19] PROBLEM - PHP7 rendering on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:19] PROBLEM - Apache HTTP on mw2366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:19] PROBLEM - PHP7 rendering on mw2366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:19] PROBLEM - Apache HTTP on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:24] sDrewthedoff: we know [10:55:25] PROBLEM - PHP7 rendering on mw2360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:55:29] okay [10:55:31] PROBLEM - Apache HTTP on mw2401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:32] PROBLEM - PHP7 rendering on mw2299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:32] PROBLEM - Apache HTTP on mw2296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:32] PROBLEM - Apache HTTP on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:32] PROBLEM - Apache HTTP on mw2326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:33] PROBLEM - Apache HTTP on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:35] PROBLEM - PHP7 rendering on mw2356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:35] PROBLEM - PHP7 rendering on mw2323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:37] PROBLEM - Apache HTTP on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:37] PROBLEM - PHP7 rendering on mw2292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:37] PROBLEM - PHP7 rendering on mw2352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:39] PROBLEM - Apache HTTP on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:41] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Jonathan5566) Now can read but very slow. [10:55:45] PROBLEM - PHP7 rendering on mw2397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:45] PROBLEM - PHP7 rendering on mw2306 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:47] PROBLEM - Apache HTTP on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:51] PROBLEM - PHP7 rendering on mw2354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:52] PROBLEM - Apache HTTP on mw2294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:52] PROBLEM - Apache HTTP on mw2306 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:52] PROBLEM - PHP7 rendering on mw2318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:53] PROBLEM - Apache HTTP on mw2400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:53] PROBLEM - Apache HTTP on mw2319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:55:53] PROBLEM - PHP7 rendering on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:55] RECOVERY - PHP7 rendering on mw2286 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.561 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:55:57] PROBLEM - Apache HTTP on mw2257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:56:02] PROBLEM - Apache HTTP on mw2318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:56:09] RECOVERY - Apache HTTP on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.895 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:11] PROBLEM - Apache HTTP on mw2367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:56:12] RECOVERY - Apache HTTP on mw2368 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.677 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:13] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:15] RECOVERY - PHP7 rendering on mw2373 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.135 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=php site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:56:17] RECOVERY - PHP7 rendering on mw2353 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.873 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:17] RECOVERY - Apache HTTP on mw2373 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.849 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:19] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:19] RECOVERY - Apache HTTP on mw2405 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:23] RECOVERY - PHP7 rendering on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.672 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:31] RECOVERY - Apache HTTP on mw2408 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:32] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Firestar464) Reading is faster than editing, though both are still very slow at the moment. [10:56:33] RECOVERY - Apache HTTP on mw2385 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.740 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:33] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:35] RECOVERY - Apache HTTP on mw2311 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.725 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:35] RECOVERY - PHP7 rendering on mw2337 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.780 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:35] RECOVERY - PHP7 rendering on mw2327 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:36] RECOVERY - PHP7 rendering on mw2304 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:36] RECOVERY - Apache HTTP on mw2387 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.080 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:36] RECOVERY - Apache HTTP on mw2365 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.250 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:37] RECOVERY - Apache HTTP on mw2391 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.438 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:37] RECOVERY - Apache HTTP on mw2355 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.568 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:37] RECOVERY - Apache HTTP on mw2333 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.675 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:38] RECOVERY - PHP7 rendering on mw2303 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.239 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:38] RECOVERY - Apache HTTP on mw2389 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.829 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:39] RECOVERY - Apache HTTP on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:39] RECOVERY - Apache HTTP on mw2361 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:40] RECOVERY - Apache HTTP on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.734 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:40] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:56:41] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:56:41] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [10:56:42] RECOVERY - PHP7 rendering on mw2305 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.902 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:43] RECOVERY - PHP7 rendering on mw2309 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:43] RECOVERY - Apache HTTP on mw2384 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.989 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:45] RECOVERY - PHP7 rendering on mw2315 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.541 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:45] RECOVERY - PHP7 rendering on mw2295 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:45] RECOVERY - Apache HTTP on mw2295 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:47] RECOVERY - PHP7 rendering on mw2333 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.472 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:47] RECOVERY - Apache HTTP on mw2407 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:47] RECOVERY - PHP7 rendering on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.368 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:47] RECOVERY - Apache HTTP on mw2277 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.867 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:47] RECOVERY - PHP7 rendering on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.895 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:48] RECOVERY - PHP7 rendering on mw2361 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:49] RECOVERY - Apache HTTP on mw2371 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.462 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:49] RECOVERY - Apache HTTP on mw2308 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:49] RECOVERY - PHP7 rendering on mw2300 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:50] RECOVERY - Apache HTTP on mw2254 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:51] RECOVERY - Apache HTTP on mw2335 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.899 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:52] RECOVERY - Apache HTTP on mw2363 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.241 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:52] RECOVERY - PHP7 rendering on mw2262 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.421 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:52] RECOVERY - PHP7 rendering on mw2258 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.741 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:53] RECOVERY - PHP7 rendering on mw2407 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:53] RECOVERY - PHP7 rendering on mw2393 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:53] RECOVERY - PHP7 rendering on mw2270 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:55] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:55] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:56] RECOVERY - PHP7 rendering on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:56] RECOVERY - PHP7 rendering on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:56] RECOVERY - Apache HTTP on mw2362 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:56] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:57] RECOVERY - Apache HTTP on mw2375 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:56:57] RECOVERY - PHP7 rendering on mw2311 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:58] RECOVERY - PHP7 rendering on mw2325 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:58] RECOVERY - PHP7 rendering on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:59] RECOVERY - PHP7 rendering on mw2400 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:56:59] RECOVERY - Apache HTTP on mw2390 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:00] RECOVERY - PHP7 rendering on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:00] RECOVERY - Apache HTTP on mw2283 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:01] RECOVERY - PHP7 rendering on mw2290 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:01] RECOVERY - Apache HTTP on mw2269 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.681 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:02] RECOVERY - Apache HTTP on mw2270 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:02] RECOVERY - PHP7 rendering on mw2336 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:03] RECOVERY - PHP7 rendering on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:05] RECOVERY - PHP7 rendering on mw2332 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:05] RECOVERY - Apache HTTP on mw2315 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:05] RECOVERY - Apache HTTP on mw2334 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:07] RECOVERY - Apache HTTP on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:07] RECOVERY - PHP7 rendering on mw2271 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:08] RECOVERY - Apache HTTP on mw2329 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:08] RECOVERY - Apache HTTP on mw2366 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:08] RECOVERY - PHP7 rendering on mw2284 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:08] RECOVERY - PHP7 rendering on mw2285 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:08] RECOVERY - PHP7 rendering on mw2366 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:09] RECOVERY - Apache HTTP on mw2262 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:09] RECOVERY - PHP7 rendering on mw2387 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:15] RECOVERY - PHP7 rendering on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:16] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:57:16] RECOVERY - PHP7 rendering on mw2276 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:17] RECOVERY - Apache HTTP on mw2275 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:18] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IEPCBM) Wikisource previously returned code 502: Bad GatΠ΅way. [10:57:19] RECOVERY - Apache HTTP on mw2388 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:20] RECOVERY - LVS apaches codfw port 80/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet IPv4 #page on appservers.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:57:20] RECOVERY - PHP7 rendering on mw2331 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:20] RECOVERY - PHP7 rendering on mw2314 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:20] RECOVERY - Apache HTTP on mw2393 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:20] RECOVERY - PHP7 rendering on mw2310 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:57:23] RECOVERY - PHP7 rendering on mw2299 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:23] RECOVERY - Apache HTTP on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:23] RECOVERY - Apache HTTP on mw2392 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:23] RECOVERY - Apache HTTP on mw2350 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:23] RECOVERY - PHP7 rendering on mw2335 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:24] RECOVERY - Apache HTTP on mw2331 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:24] RECOVERY - Apache HTTP on mw2401 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:25] RECOVERY - PHP7 rendering on mw2268 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:25] RECOVERY - Apache HTTP on mw2359 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:26] RECOVERY - PHP7 rendering on mw2254 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:26] RECOVERY - Apache HTTP on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.384 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:27] RECOVERY - Apache HTTP on mw2286 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:27] RECOVERY - PHP7 rendering on mw2356 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:28] RECOVERY - PHP7 rendering on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:28] RECOVERY - PHP7 rendering on mw2317 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:29] RECOVERY - PHP7 rendering on mw2255 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:29] RECOVERY - Apache HTTP on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:30] RECOVERY - PHP7 rendering on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:30] RECOVERY - PHP7 rendering on mw2352 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:31] RECOVERY - PHP7 rendering on mw2367 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:31] RECOVERY - PHP7 rendering on mw2269 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:32] RECOVERY - Apache HTTP on mw2261 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:33] RECOVERY - Apache HTTP on mw2369 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:33] RECOVERY - PHP7 rendering on mw2312 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:33] RECOVERY - PHP7 rendering on mw2329 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:35] RECOVERY - Apache HTTP on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:35] RECOVERY - PHP7 rendering on mw2397 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:36] RECOVERY - PHP7 rendering on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:37] RECOVERY - Apache HTTP on mw2301 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:37] RECOVERY - PHP7 rendering on mw2384 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:38] RECOVERY - Apache HTTP on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:38] RECOVERY - Apache HTTP on mw2325 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:39] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:57:39] RECOVERY - PHP7 rendering on mw2389 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:40] RECOVERY - Apache HTTP on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:40] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [10:57:41] RECOVERY - Apache HTTP on mw2316 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:42] RECOVERY - Apache HTTP on mw2406 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:42] RECOVERY - Apache HTTP on mw2309 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:42] RECOVERY - PHP7 rendering on mw2385 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:42] RECOVERY - PHP7 rendering on mw2316 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:42] RECOVERY - PHP7 rendering on mw2313 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:42] RECOVERY - Apache HTTP on mw2353 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:43] RECOVERY - PHP7 rendering on mw2354 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:43] RECOVERY - Apache HTTP on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:44] RECOVERY - Apache HTTP on mw2294 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:44] RECOVERY - PHP7 rendering on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:45] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [10:57:45] RECOVERY - PHP7 rendering on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:46] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:57:46] RECOVERY - Apache HTTP on mw2400 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:47] RECOVERY - Apache HTTP on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:47] RECOVERY - PHP7 rendering on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:48] RECOVERY - Apache HTTP on mw2255 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:48] RECOVERY - PHP7 rendering on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:49] RECOVERY - PHP7 rendering on mw2363 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:49] RECOVERY - Apache HTTP on mw2336 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:50] RECOVERY - Apache HTTP on mw2257 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:50] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:57:52] RECOVERY - PHP7 rendering on mw2371 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:53] RECOVERY - Apache HTTP on mw2305 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:53] RECOVERY - PHP7 rendering on mw2351 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:53] RECOVERY - Apache HTTP on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:53] RECOVERY - PHP7 rendering on mw2409 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:53] RECOVERY - PHP7 rendering on mw2390 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:54] RECOVERY - PHP7 rendering on mw2355 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:55] RECOVERY - PHP7 rendering on mw2392 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:57] RECOVERY - Apache HTTP on mw2357 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:57] RECOVERY - PHP7 rendering on mw2388 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:57] RECOVERY - Apache HTTP on mw2351 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:57] RECOVERY - PHP7 rendering on mw2294 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:57] RECOVERY - Apache HTTP on mw2258 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:58] RECOVERY - Apache HTTP on mw2303 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:58] RECOVERY - PHP7 rendering on mw2277 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:57:59] RECOVERY - Apache HTTP on mw2314 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:57:59] RECOVERY - Apache HTTP on mw2312 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:00] RECOVERY - Apache HTTP on mw2327 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:00] RECOVERY - PHP7 rendering on mw2365 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:01] RECOVERY - Apache HTTP on mw2338 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:01] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:58:02] RECOVERY - PHP7 rendering on mw2357 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:02] RECOVERY - PHP7 rendering on mw2338 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:03] RECOVERY - PHP7 rendering on mw2369 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:03] RECOVERY - Apache HTTP on mw2310 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:04] RECOVERY - PHP7 rendering on mw2274 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:04] RECOVERY - Apache HTTP on mw2276 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:05] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:58:05] RECOVERY - Apache HTTP on mw2367 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:06] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [10:58:06] RECOVERY - PHP7 rendering on mw2406 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:07] RECOVERY - PHP7 rendering on mw2408 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:07] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [10:58:08] RECOVERY - PHP7 rendering on mw2301 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:08] RECOVERY - Apache HTTP on mw2268 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:09] RECOVERY - Apache HTTP on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:09] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:58:11] RECOVERY - PHP7 rendering on mw2257 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:11] RECOVERY - Apache HTTP on mw2337 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:12] RECOVERY - Apache HTTP on mw2409 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:12] RECOVERY - PHP7 rendering on mw2275 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:15] RECOVERY - Apache HTTP on mw2274 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:15] RECOVERY - PHP7 rendering on mw2359 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:58:17] RECOVERY - PHP7 rendering on mw2375 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:58:25] RECOVERY - Apache HTTP on mw2313 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:58:35] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:58:35] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:58:35] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:58:37] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:58:53] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Peachey88) [10:59:13] RECOVERY - phpfpm_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:59:16] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:59:24] thx James_F [10:59:25] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:59:25] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:59:34] Of course. [10:59:39] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:59:43] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Ryse93) For me it works now [10:59:51] seems like it's caused by the ruwikinews huh [10:59:54] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8347 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [10:59:55] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at codfw #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8076 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver [10:59:57] Yet again. [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:07] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Ladsgroup) Status update: I disabled DPL on ruwikinews and now things are getting back to normal [11:00:11] RECOVERY - High average POST latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [11:00:16] This was the last three site-downs too. [11:00:21] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [11:00:41] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Firestar464) As a result, Twinkle has finally loaded. [11:00:47] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:00:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:57] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [11:01:11] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 25 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:02:18] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Peachey88) [11:03:10] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) Now it's a little better, and some pages are accessible. [11:04:43] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10RhinosF1) Please avoid telling us how things look every 30 seconds [11:05:01] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) You couldn't solve the problem for a whole year and now you've killed Russian Wikinews again. But these are some... [11:08:08] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 4 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10LSobanski) The problem happened again - see T287362. Could this task be reviewed in terms of priority? [11:08:46] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krinkle) I got this error just now when using search on en.wikipedia.org, via the browser's search bar at 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 5 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Peachey88) [11:11:03] (03PS1) 10Hashar: gerrit: add boiler plate spec for gerrit::jetty [puppet] - 10https://gerrit.wikimedia.org/r/708088 (https://phabricator.wikimedia.org/T287360) [11:11:05] (03PS1) 10Hashar: gerrit: quote values in config when having # or ; [puppet] - 10https://gerrit.wikimedia.org/r/708089 (https://phabricator.wikimedia.org/T287360) [11:11:11] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:13:07] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10ReaperDawn) Looks like it is starting to recover [11:15:46] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/708091 (owner: 10L10n-bot) [11:18:26] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10DonSimon) >>! In T287362#7235783, @Ladsgroup wrote: > Status update: I disabled DPL on ruwikinews and now things are getting b... [11:37:23] (03PS1) 10R4q3NWnUx2CEhVyr: Second submission for realloc misusage. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/708094 [11:38:05] (03Abandoned) 10R4q3NWnUx2CEhVyr: Allocate only the needed size for the format structure array [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr) [11:40:41] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10MBH) FYI: ruwikinews "owners" (Krassotkin & Co.) recently uploaded to ruwikinews millions of "news", created 10-15 years ago o... [11:45:04] (03PS1) 10Cathal Mooney: First attempt to create puppet class for statograph service, which exports statistics about WMF infra to statuspage.io for external visibility. Info: https://gerrit.wikimedia.org/r/admin/repos/operations/software/statograph [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [11:45:34] (03CR) 10jerkins-bot: [V: 04-1] First attempt to create puppet class for statograph service, which exports statistics about WMF infra to statuspage.io for external visibility. Info: https://gerrit.wikimedia.org/r/admin/repos/operations/software/statograph [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [11:45:40] (03CR) 10Nikerabbit: "l10nbot-watchers do not have the required rights here." [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/706390 (owner: 10L10n-bot) [11:49:02] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Nikerabbit) l10n-bot-watchers rights are still missing, so we can't even override Jenkins currently. Docs: https:/... [11:49:54] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IEPCBM) >>! In T287362#7235910, @MBH wrote: > FYI: ruwikinews "owners" (Krassotkin & Co.) recently uploaded to ruwikinews mill... [11:51:33] (03PS2) 10Cathal Mooney: First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) [11:52:09] (03CR) 10jerkins-bot: [V: 04-1] First attempt to create puppet class for statograph service. [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [11:55:17] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Joe) $he issue, which to clarify was a global outage of all wikis, happened because of that mass upload triggered a lot of new... [11:57:20] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Marostegui) For what is worth, these are the wikis living on s3 databases: https://noc.wikimedia.org/conf/dblists/s3.dblist [12:03:02] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10alaa) >>! In T287362#7235932, @Marostegui wrote: > For what is worth, these are the wikis living on s3 databases: https://noc.... [12:05:26] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Ladsgroup) >>! In T287362#7235937, @alaa wrote: >>>! In T287362#7235932, @Marostegui wrote: >> For what is worth, these are th... [12:05:52] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Majavah) >>! In T287362#7235937, @alaa wrote: >>>! In T287362#7235932, @Marostegui wrote: >> For what is worth, these are the... [12:06:52] (03PS5) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [12:09:49] 10SRE, 10Wikimedia-Incident: Lots of s3 wikis broken: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10patilise) With a second incident, it is clear that the issue will be recurring if we simply re-enable DPL on ruwikinews now. I... [12:09:59] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews abuse of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Jdforrester-WMF) [12:10:57] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews abuse of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Jdforrester-WMF) I've re-titled the task to be more accurate; the outage did indeed expand beyon... [12:16:38] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Peachey88) [12:17:40] (03Abandoned) 10Hashar: gerrit: quote values in config when having # or ; [puppet] - 10https://gerrit.wikimedia.org/r/708089 (https://phabricator.wikimedia.org/T287360) (owner: 10Hashar) [12:18:23] (03PS1) 10Jbond: policy-rc.d: update policy-rc.d script to handle missing services [puppet] - 10https://gerrit.wikimedia.org/r/708100 [12:22:07] (03PS1) 10Ladsgroup: Disable DPL on ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708101 (https://phabricator.wikimedia.org/T287362) [12:23:58] (03PS6) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [12:25:34] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Ladsgroup) >>! In T287362#7235854, @DonSimon wrote: >>>! In T287362#723578... [12:26:28] (03CR) 10Ladsgroup: [C: 03+2] "Already deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708101 (https://phabricator.wikimedia.org/T287362) (owner: 10Ladsgroup) [12:27:13] (03Merged) 10jenkins-bot: Disable DPL on ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708101 (https://phabricator.wikimedia.org/T287362) (owner: 10Ladsgroup) [12:30:28] (03PS1) 10Hashar: Revert "gerrit: daemon option in gerrit.config" [puppet] - 10https://gerrit.wikimedia.org/r/708102 (https://phabricator.wikimedia.org/T287122) [12:30:30] (03PS1) 10Hashar: gerrit: remove unused settings from [container] [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) [12:30:32] (03PS1) 10Hashar: gerrit: remove unused container.javaOptions values [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) [12:31:17] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 5 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10mark) p:05Mediumβ†’03High Given that the underlying problem that this change might help with has already caused multiple full outages (all wikis... [12:32:37] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Aklapper) @Krassotkin: Please see and follow https://www.mediawiki.org/wik... [12:35:22] (03PS1) 10Filippo Giunchedi: haproxy: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/708105 [12:35:24] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) Absolutely everyone knows about this problem and the ways to s... [12:35:25] (03PS1) 10Filippo Giunchedi: haproxy: remove sleep 10 [puppet] - 10https://gerrit.wikimedia.org/r/708106 [12:38:11] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Aklapper) @Krassotkin: You are very welcome to contribute by providing pat... [12:43:45] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) >>! In T287362#7236049, @Aklapper wrote: > @Krassotkin: You ar... [12:43:59] (03CR) 10Jcrespo: [C: 03+1] "Moritz will know if buster was tested in the end and the issue was confirmed as fixed. We may have some standby haproxies to test it other" [puppet] - 10https://gerrit.wikimedia.org/r/708106 (owner: 10Filippo Giunchedi) [12:44:54] (03CR) 10Jcrespo: [C: 03+1] "Thank by the way to remind us to revert this!" [puppet] - 10https://gerrit.wikimedia.org/r/708106 (owner: 10Filippo Giunchedi) [12:46:01] jynus: sure np! I ran some tests with haproxy last week and was wondering why it was taking so long to restart [12:46:13] thank you for the quick review [12:46:24] he he it was anoying indeed [12:46:57] please involve the people I suggest, I predict no issue, but better to ping them [12:47:20] (03PS2) 10Filippo Giunchedi: haproxy: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/708105 [12:47:22] (03PS2) 10Filippo Giunchedi: haproxy: remove sleep 10 [puppet] - 10https://gerrit.wikimedia.org/r/708106 [12:47:23] makes sense, I'll add them [12:47:34] thanks to you for doing the work, I mentioned the need to revert on an old comment but then one forgets! [12:48:04] it was not an elegant patch but it was better than having no firewall! [12:48:42] (03PS7) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [12:49:38] indeed, I have another patch to remove generate_haproxy_default.sh which afaik now we can do [12:50:16] uf, https://memegenerator.net/img/instances/45671676/now-thats-a-name-i-havent-heard-in-a-long-time.jpg [12:50:24] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Aklapper) @Krassotkin: In that case, please either refrain from adding unh... [12:50:47] haha [12:51:05] (03PS1) 10Filippo Giunchedi: haproxy: read config directory natively [puppet] - 10https://gerrit.wikimedia.org/r/708108 [12:52:04] (03CR) 10Filippo Giunchedi: "I went with restarting haproxy if commandline changes, though I'm not attached to it, please LMK what you think" [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [12:53:03] (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/708109 [12:53:27] (03CR) 10Filippo Giunchedi: "Also to note: this and related I40665b989 Ib4727df53f are not urgent on my end but certainly nice to have deployed" [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [12:54:11] (03CR) 10Jcrespo: [C: 03+1] "Still +1, but maintainers affected should be the ones to decide on details." [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [12:58:33] (03PS1) 10DCausse: rdf-streaming-updater: use image version 2021-07-26-125114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/708111 (https://phabricator.wikimedia.org/T264006) [12:59:34] (03CR) 10Jcrespo: [C: 03+1] haproxy: read config directory natively (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [13:01:14] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1001/30338/" [puppet] - 10https://gerrit.wikimedia.org/r/708102 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [13:05:58] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) @Aklapper Please note I don't speak English, but there is noth... [13:13:57] (03PS8) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [13:15:44] (03CR) 10Marostegui: [C: 03+1] haproxy: remove sleep 10 [puppet] - 10https://gerrit.wikimedia.org/r/708106 (owner: 10Filippo Giunchedi) [13:16:11] (03CR) 10Marostegui: [C: 03+1] haproxy: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/708105 (owner: 10Filippo Giunchedi) [13:23:07] (03CR) 10JMeybohm: [C: 03+1] miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [13:28:09] (03CR) 10Ottomata: Add stream configuration for ContentTranslation events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [13:30:20] (03CR) 10Ottomata: Deprecate profile::analytics::cluster::users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [13:38:35] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) >>! In T287362#7235937, @alaa wrote: >>>! In T287362#7235932, @Marostegui wrote: >> For what... [13:39:21] (03PS10) 10KartikMistry: Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [13:39:50] (03CR) 10KartikMistry: Add stream configuration for ContentTranslation events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [13:40:45] (03PS9) 10Btullis: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) [13:42:35] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:42:38] (03Abandoned) 10DCausse: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/708109 (owner: 10PipelineBot) [13:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:35] (03PS1) 10Btullis: Add an alluxio keytab to an-test-presto1001.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/708115 [13:51:05] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Sunny00217) >>! In T287362#7236237, @IN wrote: >>>! In T287362#7235937, @alaa wrote: >>>>! In T2... [13:51:41] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) Now phabricator also runs more slowly, it also takes a very long time to query a tag. [13:52:09] (03PS2) 10Btullis: Add an alluxio keytab to an-test-presto1001.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/708115 [13:54:00] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) >>! In T287362#7236262, @Sunny00217 wrote: >>>! In T287362#7236237, @IN wrote: >>>>! In T287... [13:56:07] (03PS3) 10Btullis: Add an alluxio keytab to an-test-presto1001.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/708115 [13:58:14] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) The server took a long time to preview my message. @anyilin Maybe you can see why it running... [13:58:41] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10sbassett) [13:59:29] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Aklapper) @IN: Please stop adding totally unrelated comments. This task is not about Phabricator... [14:01:28] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) >>! In T287362#7236310, @Aklapper wrote: > @IN: Please stop adding totally unrelated comment... [14:03:38] (03CR) 10Elukey: "I left some comments but the code change as it is looks good to me. Before proceeding I'd still wait either Moritz's or John's +1 since we" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [14:03:56] (03CR) 10Jbond: Revert "gerrit: daemon option in gerrit.config" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708102 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [14:08:58] (03CR) 10Hashar: Revert "gerrit: daemon option in gerrit.config" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708102 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [14:13:23] 10SRE, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Aklapper) @IN: Obviously yes if you do not get the very same error message as in this task. See... [14:14:12] elukey: re service userrs in puppet or data.yaml [14:14:41] i think if we keep them in puppet, the classes will be more useable outside of places that mighht not include the admin module [14:14:44] e.g. cloud vps [14:15:10] so, ya, am proposing to keep system users like analytics-search in data.yaml, since those are really for use by real people decalred in data.yaml [14:15:21] but service/daemon users like hadoop and yarn in puppet. [14:15:28] am fine withi addibng placeholder comments in data.yaml [14:15:30] what do you think? [14:16:31] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2826 MB (3% inode=84%): /tmp 2826 MB (3% inode=84%): /var/tmp 2826 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [14:17:26] (03PS1) 10Muehlenhoff: Remove account end date for Hal Triedman [puppet] - 10https://gerrit.wikimedia.org/r/708118 [14:17:29] oh oops wrong room [14:18:40] 10SRE, 10SRE-Access-Requests: Issues with server access, assistance requested - https://phabricator.wikimedia.org/T287245 (10RLazarus) 05Openβ†’03Resolved All set! For posterity, the issue turned out to be twofold -- one, exactly as @Reedy guessed, SSH was one version too old to understand `ProxyJump` so we... [14:20:04] (03CR) 10Muehlenhoff: [C: 03+2] Remove account end date for Hal Triedman [puppet] - 10https://gerrit.wikimedia.org/r/708118 (owner: 10Muehlenhoff) [14:20:29] (03PS2) 10Muehlenhoff: Remove account end date for Hal Triedman [puppet] - 10https://gerrit.wikimedia.org/r/708118 [14:23:24] (03CR) 10RLazarus: [C: 03+2] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [14:25:47] (03CR) 10RLazarus: "Argh, sorry to self-merge instead of waiting for Jenkins -- too many Puppet patches in a row, I just got used to hitting the button. πŸ˜–" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [14:33:18] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Ladsgroup) 05Openβ†’03Resolved a:03Ladsgroup The site is back online aft... [14:34:02] (03CR) 10Ottomata: [C: 03+1] Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [14:35:09] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Ladsgroup) {T287380} The first one. [14:36:00] (03CR) 10Elukey: "Looks good! Feel free to +2 +2 and submit" [labs/private] - 10https://gerrit.wikimedia.org/r/708115 (owner: 10Btullis) [14:37:23] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [14:37:36] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use image version 2021-07-26-125114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/708111 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [14:37:54] 10SRE, 10DynamicPageList (Wikimedia): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) [14:38:06] (03CR) 10Ottomata: Deprecate profile::analytics::cluster::users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [14:38:57] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Aklapper) [14:40:00] (03CR) 10Ottomata: Deprecate profile::analytics::cluster::users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [14:40:18] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 (10Papaul) FPC0 S/N updated in Netbox [14:40:39] (03Merged) 10jenkins-bot: rdf-streaming-updater: use image version 2021-07-26-125114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/708111 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [14:41:15] PROBLEM - Host mw2336 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:49] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Jdforrester-WMF) IIRC, Wikimedia's DPL fork was created as part of the Wikivoyage migration rush because a few of the incoming communities insisted they neede... [14:42:58] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:57] PROBLEM - Host mw2336.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:44:24] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 (10Papaul) Shipped out faulty line card today. Tracking information below {F34564155} [14:51:43] (03CR) 10Ladsgroup: [C: 03+2] Don’t generate current content text twice [extensions/AbuseFilter] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707021 (owner: 10Lucas Werkmeister (WMDE)) [14:52:05] deploying it now ^ [14:52:11] Lucas_WMDE: cc [14:52:19] ack [14:52:24] now we wait for twenty minutes :D [14:56:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/708100 (owner: 10Jbond) [14:57:02] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10IEPCBM) Can DPL be somehow optimized so as not to load the DB? [14:58:02] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Urbanecm) >>! In T263220#7236031, @mark wrote: > Given that the underlying problem that this change might help with has already caused multiple fu... [14:58:16] (03PS1) 10Urbanecm: Revert "Revert "Add PoolCounter settings for DPL"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708051 (https://phabricator.wikimedia.org/T263220) [14:58:31] (03PS2) 10Urbanecm: Revert "Revert "Add PoolCounter settings for DPL"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708051 (https://phabricator.wikimedia.org/T263220) [14:59:53] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add an alluxio keytab to an-test-presto1001.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/708115 (owner: 10Btullis) [15:00:03] (03CR) 10Bstorm: cloud dns: tidy up the labs-ip-alias-dump script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) (owner: 10Bstorm) [15:01:29] (03CR) 10Bstorm: cloud dns: tidy up the labs-ip-alias-dump script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) (owner: 10Bstorm) [15:01:31] (03CR) 10Jbond: [C: 03+2] Revert "gerrit: daemon option in gerrit.config" [puppet] - 10https://gerrit.wikimedia.org/r/708102 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [15:02:07] hashar: fyi merged ^^ sorry for the delay i had another meeting straight after [15:02:18] no worries ;) [15:02:24] btullis: fyi i merged your change i the private repo [15:03:11] jbond: and you gotta press [Submit] + puppet-merge it ;] [15:04:30] (03PS1) 10Muehlenhoff: Move systemd presets to /run [puppet] - 10https://gerrit.wikimedia.org/r/708121 [15:05:00] (03CR) 10KartikMistry: "Scheduled for deployment: https://wikitech.wikimedia.org/wiki/Deployments#Tuesday%2C_July_27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [15:06:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/708121 (owner: 10Muehlenhoff) [15:08:00] (03CR) 10David Caro: [C: 03+1] cloud dns: tidy up the labs-ip-alias-dump script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) (owner: 10Bstorm) [15:08:47] (03CR) 10Muehlenhoff: haproxy: read config directory natively (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [15:11:00] (03Merged) 10jenkins-bot: Don’t generate current content text twice [extensions/AbuseFilter] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707021 (owner: 10Lucas Werkmeister (WMDE)) [15:11:59] (03CR) 10Muehlenhoff: haproxy: read config directory natively (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708108 (owner: 10Filippo Giunchedi) [15:13:37] (03PS2) 10Bstorm: cloud dns: tidy up the labs-ip-alias-dump script [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) [15:14:53] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:17:28] (03CR) 10Ryan Kemper: [C: 03+2] Revert "thanos-swift envoy listener: rewrite HTTP host header" [puppet] - 10https://gerrit.wikimedia.org/r/705480 (owner: 10DCausse) [15:19:59] !log Adding peering to AS139931 - Bangladesh Submarine Cable Company - at Equinix Singapore on cr3-eqsin [15:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:57] I get ssh: connect to host mw2336.codfw.wmnet port 22: Connection timed out in scap [15:21:59] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.15/extensions/AbuseFilter/includes/VariableGenerator/RunVariableGenerator.php: Backport: [[gerrit:707021|Don’t generate current content text twice]], Part I (duration: 01m 50s) [15:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:16] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Reedy) >>! In T287380#7236461, @Jdforrester-WMF wrote: > IIRC, Wikimedia's DPL fork was created as part of the Wikivoyage migration rush because a few of the... [15:24:30] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.15/extensions/AbuseFilter/includes/AbuseFilterHooks.php: Backport: [[gerrit:707021|Don’t generate current content text twice]], Part II (duration: 01m 49s) [15:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:22] 10SRE, 10Infrastructure-Foundations, 10SRE-tools, 10serviceops: Documentation updates in decom workflow - https://phabricator.wikimedia.org/T287388 (10RLazarus) p:05Triageβ†’03Low [15:25:46] (03PS1) 10Hashar: gerrit: disabled patchset level comments [puppet] - 10https://gerrit.wikimedia.org/r/708124 (https://phabricator.wikimedia.org/T287385) [15:26:41] (03PS1) 10Andrew Bogott: Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 [15:26:44] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Jdforrester-WMF) >>! In T287380#7236598, @Reedy wrote: >>>! In T287380#7236461, @Jdforrester-WMF wrote: >> IIRC, Wikimedia's DPL fork was created as part of t... [15:27:09] (03CR) 10jerkins-bot: [V: 04-1] Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 (owner: 10Andrew Bogott) [15:28:47] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) I haven't fully reviewed the incident yet, but my understanding is that our DPL fork isn't as bad as the others, but it still has some issues. Most w... [15:29:18] !log Restarted gerrit replica on gerrit2001.wikimedia.org # T287122 [15:29:18] (03PS2) 10Andrew Bogott: Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 [15:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:25] T287122: resolve gerrit.config disprepancy between managed config and gerrit init - https://phabricator.wikimedia.org/T287122 [15:29:55] (03CR) 10jerkins-bot: [V: 04-1] Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 (owner: 10Andrew Bogott) [15:31:45] jbond: that worked. thank you! [15:32:41] np [15:34:48] (03PS3) 10Andrew Bogott: Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 [15:35:15] 10SRE, 10Services, 10Toolhub, 10serviceops, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10wkandek) [15:37:55] Amir1: it went down an hour ago [15:39:21] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Legoktm) @Urbanecm I'm on clinic duty this week and just so happen to have PoolCounter experience so let's find a time to pair on this. [15:41:17] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q1): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [15:41:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/708105 (owner: 10Filippo Giunchedi) [15:41:23] legoktm: can you look at why mw2336 is dead [15:41:46] o.O yeah [15:46:43] (03PS1) 10Urbanecm: toolforge: Install arc in exec environ [puppet] - 10https://gerrit.wikimedia.org/r/708129 (https://phabricator.wikimedia.org/T287390) [15:46:43] it was in ejegg's git dirs. [15:46:55] let me see if i captured them to a file [15:47:23] Jul 15 04:31:41 frdev1001 malware_detector[22368]: Scanning /home/ejegg/payments/.git/objects/pack/pack-4facf6f3de6bc5192df73c919b20ccfca16e5fe6.pack [15:47:30] Jul 15 05:19:38 frpm1001 malware_detector[12572]: Scanning /srv/www/org/wikimedia/payments/.git/objects/pack/pack-43736e2bf94de4c92b686cb9a46c0da30cdb6ae6.pack [15:47:30] (03CR) 10Dzahn: [C: 03+2] add chart for miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:47:35] Jul 15 06:17:06 frnetmon1001 malware_detector[25243]: Scanning /usr/share/nmap/nmap-service-probes [15:47:50] gah. sorry. wrong channel [15:47:54] sorry about that. [15:48:53] (03PS4) 10Andrew Bogott: Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 [15:49:37] (03CR) 10jerkins-bot: [V: 04-1] Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 (owner: 10Andrew Bogott) [15:50:19] (03Merged) 10jenkins-bot: add chart for miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:51:20] 10SRE, 10Analytics-Clusters, 10Infrastructure-Foundations, 10netops: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865 (10Ottomata) [15:52:44] (03PS5) 10Andrew Bogott: Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 [15:55:14] (03CR) 10Dzahn: [C: 03+2] gerrit: add boiler plate spec for gerrit::jetty [puppet] - 10https://gerrit.wikimedia.org/r/708088 (https://phabricator.wikimedia.org/T287360) (owner: 10Hashar) [15:56:24] (03PS2) 10Jbond: Move systemd presets to /run [puppet] - 10https://gerrit.wikimedia.org/r/708121 (owner: 10Muehlenhoff) [15:56:26] (03PS1) 10Jbond: wmflib::dir::mkdir_p: Use ensure_resource for full directory [puppet] - 10https://gerrit.wikimedia.org/r/708130 [15:58:40] (03CR) 10Dzahn: "The comments above these options explain why these have been kept on purpose in the past. So I think either that was a good reason to keep" [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [15:58:52] (03CR) 10jerkins-bot: [V: 04-1] wmflib::dir::mkdir_p: Use ensure_resource for full directory [puppet] - 10https://gerrit.wikimedia.org/r/708130 (owner: 10Jbond) [16:04:47] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2336.codfw.wmnet [16:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:18] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [16:05:34] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30344/console" [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [16:06:02] (03PS2) 10Jbond: wmflib::dir::mkdir_p: Use ensure_resource for full directory [puppet] - 10https://gerrit.wikimedia.org/r/708130 [16:06:25] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [16:06:39] 10ops-codfw, 10DC-Ops: hw troubleshooting: mw2336.codfw.wmnet and it's mgmt are down - https://phabricator.wikimedia.org/T287394 (10Legoktm) [16:06:58] !log depooled mw2336.codfw.mwnet, mgmt is down too. T287394 [16:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:05] T287394: hw troubleshooting: mw2336.codfw.wmnet and it's mgmt are down - https://phabricator.wikimedia.org/T287394 [16:07:06] (03PS3) 10Jbond: wmflib::dir::mkdir_p: Use ensure_resource for full directory [puppet] - 10https://gerrit.wikimedia.org/r/708130 [16:07:28] Amir1, RhinosF1: https://phabricator.wikimedia.org/T287394 [16:08:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30346/console" [puppet] - 10https://gerrit.wikimedia.org/r/708130 (owner: 10Jbond) [16:08:34] 10ops-codfw, 10DC-Ops: hw troubleshooting: mw2336.codfw.wmnet and its mgmt are down - https://phabricator.wikimedia.org/T287394 (10Ladsgroup) [16:09:11] ACKNOWLEDGEMENT - SSH on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Legoktm T287394 https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:09:11] ACKNOWLEDGEMENT - PHP7 rendering on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Legoktm T287394 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:09:11] ACKNOWLEDGEMENT - Memcached on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Legoktm T287394 https://wikitech.wikimedia.org/wiki/Memcached [16:09:12] ACKNOWLEDGEMENT - Apache HTTP on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Legoktm T287394 https://wikitech.wikimedia.org/wiki/Application_servers [16:09:12] ACKNOWLEDGEMENT - Host mw2336 is DOWN: PING CRITICAL - Packet loss = 100% Legoktm T287394 [16:09:26] ACKNOWLEDGEMENT - SSH on mw2336.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Legoktm T287394 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:09:26] ACKNOWLEDGEMENT - Host mw2336.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Legoktm T287394 [16:11:48] (03PS3) 10Jbond: Move systemd presets to /run [puppet] - 10https://gerrit.wikimedia.org/r/708121 (owner: 10Muehlenhoff) [16:14:32] (03PS4) 10Jbond: Move systemd presets to /run [puppet] - 10https://gerrit.wikimedia.org/r/708121 (owner: 10Muehlenhoff) [16:14:55] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) [16:16:05] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:10] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) Cloud team has decided we have too much in this row, and since breakage is possible if we freeze the cloud intentionally, we are going... [16:18:06] (03PS5) 10Jbond: Move systemd presets to /run [puppet] - 10https://gerrit.wikimedia.org/r/708121 (owner: 10Muehlenhoff) [16:18:35] (03CR) 10Jbond: [C: 03+1] "LGTM (see previous PS)" [puppet] - 10https://gerrit.wikimedia.org/r/708121 (owner: 10Muehlenhoff) [16:18:47] 10SRE, 10Data-Persistence-Backup, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) a:03jcrespo [16:39:11] (03CR) 10Jbond: "LGTM, see commented inline (also no tests :()" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708095 (https://phabricator.wikimedia.org/T285569) (owner: 10Cathal Mooney) [16:56:35] RECOVERY - Host mw2336 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [16:58:09] RECOVERY - Host mw2336.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms [17:00:04] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T1700). [17:07:00] (03CR) 10Hashar: [C: 04-1] "I have to check whether there was a good reason to include every single java options but I could not find any. gerrit init only injects:" [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [17:11:21] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10Legoktm) a:03Legoktm [17:12:51] (03PS1) 10Legoktm: admin: Add cmacholan to ldap_only_users for "wmf" group access [puppet] - 10https://gerrit.wikimedia.org/r/708145 (https://phabricator.wikimedia.org/T287233) [17:16:37] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:18:03] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reimage dbprov1002 to buster [puppet] - 10https://gerrit.wikimedia.org/r/707243 (https://phabricator.wikimedia.org/T287230) (owner: 10Jcrespo) [17:19:05] (03CR) 10Hashar: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [17:20:50] (03PS2) 10Hashar: gerrit: remove unused settings from [container] [puppet] - 10https://gerrit.wikimedia.org/r/708103 (https://phabricator.wikimedia.org/T287122) [17:23:36] (03PS2) 10Hashar: gerrit: remove unused container.javaOptions values [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) [17:23:50] 10SRE, 10Data-Persistence-Backup, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) p:05Triageβ†’03High [17:30:10] (03CR) 10Hashar: [C: 03+1] "I suspected we might require all JVM options we use to be listed in case "gerrit init" wrote any options the JVM ight have been given. But" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708104 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [17:31:04] (03CR) 10Hashar: "It is a noop for our current Gerrit 3.2 but will be recognized by Gerrit 3.3." [puppet] - 10https://gerrit.wikimedia.org/r/708124 (https://phabricator.wikimedia.org/T287385) (owner: 10Hashar) [17:31:45] fyi the mwmaint1002 fingerprints changed because of reimaging, https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fmwmaint1002.eqiad.wmnet&type=revision&diff=1919694&oldid=1804776 [17:31:51] (not an issue until we switchback) [17:37:43] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: mw2336.codfw.wmnet and its mgmt are down - https://phabricator.wikimedia.org/T287394 (10Papaul) 05Openβ†’03Resolved Reset and upgrade IDRAC. Server is back up online. [17:38:09] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1002.eqiad.wmnet with reason: REIMAGE [17:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:20] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1002.eqiad.wmnet with reason: REIMAGE [17:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:25] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2336.codfw.wmnet [17:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:35] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10Papaul) [17:41:40] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: mw2336.codfw.wmnet and its mgmt are down - https://phabricator.wikimedia.org/T287394 (10Legoktm) Thank you! [17:41:59] !log ran `scap pull` and repooled mw2336.codfw.wmnet - T287394 [17:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:05] T287394: hw troubleshooting: mw2336.codfw.wmnet and its mgmt are down - https://phabricator.wikimedia.org/T287394 [17:43:35] (03CR) 10Andrew Bogott: Added wmcs-pause-cloud admin script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708125 (owner: 10Andrew Bogott) [17:44:13] legoktm: I don't see any customers for morning B&C (in ~15 mins), do you want to try to push https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708051 out (the DPL poolcounter config)? [17:44:21] sure [17:45:19] excellent, thanks. [17:45:37] do we have a DPL test page on a wiki somewhere? [17:45:52] i definitely played with it at ruwikinews [17:45:55] let me find something [17:46:01] ...without re-enabling on ruwikinews :p [17:46:11] oh [17:46:26] i have some code at https://ru.wikinews.org/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:Martin_Urbanec/sand [17:46:36] should be easy enough to bring to some other DPL-enabled wiki [17:47:28] "In the event DPL is causing DB problems, decrease to 2." [17:47:33] We can enable it temporarily on testwiki too [17:47:44] does that mean timeout or workers/maxqueue? [17:47:46] or even at ruwikinews, mwdebug only [17:48:32] legoktm[m]: does https://en.wikinews.org/wiki/Chili_Finger_Incident#Recently_Edited_headlines work as an example? [17:48:33] jynus: workers/maxqueue I believe [17:48:57] (03PS13) 10Ottomata: Use admin module to manage system user for use by human users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [17:49:11] Probably! [17:49:20] lol, you seem as certain as I do [17:49:50] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-08-31) rack/setup/install ganeti-test200[123] - https://phabricator.wikimedia.org/T286484 (10Papaul) [17:49:53] if you find out for sure, I will send you a patch for clearer wording of comment :-) [17:49:54] And lol, that happened where I live. I remember when that happened, it was insane. [17:50:09] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) >>! In T287362#7236399, @Ladsgroup wrote: > The site is back onl... [17:50:16] I hate to think in the middle of an outage [17:51:20] jynus: maybe ping bawollf in the task? He originally wrote that patch. [17:51:55] yeah, not asking you to do anything, sorry, it was more of a thinking aloud [17:54:14] (03PS3) 10Jcrespo: dbbackups: Reorganize backups after dbprov1002 reimage [puppet] - 10https://gerrit.wikimedia.org/r/707250 (https://phabricator.wikimedia.org/T287230) [17:54:24] I'm thinking too, about how to clarify it :D [17:54:33] I can take care of that [17:54:40] thanks [17:54:46] I just am unsure which of the 2 is [17:54:57] I will ask him on ticket [17:55:00] as you suggest [17:58:12] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10jcrespo) @Bawolff I think you wrote the comment at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/645994/2/wmf-config/PoolCounterS... [18:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:15] I'm here [18:00:32] legoktm: ok for me to start with the DPL patch? [18:00:36] yep [18:01:35] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Add PoolCounter settings for DPL"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708051 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [18:02:18] (03Merged) 10jenkins-bot: Revert "Revert "Add PoolCounter settings for DPL"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708051 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [18:02:57] legoktm: patch is at mwdebug2001, if you want to test it there [18:03:07] (03PS6) 10Andrew Bogott: Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 [18:03:09] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Whatamidoing-WMF) Background information from September 2020, when mass uplo... [18:03:15] where did the poolcounter errors show up last time, in logstash? [18:03:20] yes [18:04:03] (i actually looked in poolcounter.log at mwlog, but i think it's also in logstash) [18:04:10] [53f8f9cc-81c4-430b-920e-df58f713cf81] 2021-07-26 18:04:01: Fatal exception of type "Error" [18:04:17] tried purging the chili page [18:05:08] (03CR) 10Andrew Bogott: [C: 03+2] Added wmcs-pause-cloud admin script [puppet] - 10https://gerrit.wikimedia.org/r/708125 (owner: 10Andrew Bogott) [18:05:29] that's Error: Class 'PoolCounter_Client' [18:05:29] Error: Class 'PoolCounter_Client' not found [18:05:30] oof [18:05:34] it got namespaced since [18:05:44] i should use Client::class instead i guess [18:05:45] fixing [18:06:08] MediaWiki\Extension\PoolCounter\Client now [18:06:55] (03PS1) 10Urbanecm: DPL PoolCounter setting: Fix class name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708148 (https://phabricator.wikimedia.org/T263220) [18:07:12] (03CR) 10Andrew Bogott: [C: 03+1] "This is a step in the right direction!" [puppet] - 10https://gerrit.wikimedia.org/r/708042 (https://phabricator.wikimedia.org/T287269) (owner: 10Filippo Giunchedi) [18:07:13] legoktm: fix is at mwdebug now [18:08:07] uh, you sure? [18:08:14] "CirrusSearch-Automated"? [18:08:26] DPL is untouched [18:08:51] meh [18:11:05] (03PS2) 10Urbanecm: DPL PoolCounter setting: Fix class name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708148 (https://phabricator.wikimedia.org/T263220) [18:11:10] now i'm sure [18:11:27] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reorganize backups after dbprov1002 reimage [puppet] - 10https://gerrit.wikimedia.org/r/707250 (https://phabricator.wikimedia.org/T287230) (owner: 10Jcrespo) [18:12:06] purging works legoktm [18:13:04] 2021-07-26 18:11:47 [ae3dc6ce-1847-4890-9076-e092c9ddc6f2] mwdebug2001 enwikinews 1.37.0-wmf.15 poolcounter INFO: Pool key 'nowait:dpl-query:enwikinews' (DPL): Error reading from pool counter server 10.192.0.132. [18:13:29] looks like same error like last time [18:13:45] that IP is poolcounter2003, so it's hitting the right server [18:14:46] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Aklapper) https://ru.wikinews.org is not "actually offline": You can verify... [18:16:14] I keep purging and I can't get it to show up again [18:17:07] try previewing it [18:17:14] i just did it like three four times, and i see one more in the logs [18:17:44] can you enable "verbose log" in XWMD too? [18:17:50] sure [18:18:12] I can't preview, it's full protected :| [18:18:25] oh [18:18:38] got it [18:18:44] 2021-07-26 18:18:00 [09ff3650-dbbc-46b5-865d-ac617e696e18] mwdebug2001 enwikinews 1.37.0-wmf.15 wfDebug DEBUG: Sending pool counter command: ACQ4ME nowait:dpl-query:enwikinews 25 25 0 [18:18:45] 2021-07-26 18:18:00 [09ff3650-dbbc-46b5-865d-ac617e696e18] mwdebug2001 enwikinews 1.37.0-wmf.15 poolcounter INFO: Pool key 'nowait:dpl-query:enwikinews' (DPL): Error reading from pool counter server 10.192.0.132. [18:18:55] so the debug log worked? [18:18:56] good [18:20:43] ACQ4ME nowait:dpl-query:enwikinews 25 25 0 [18:20:43] LOCKED [18:20:49] works fine when I try it directly with `nc` [18:24:44] it also happened before, although not as frequently as with this DPL [18:25:01] see ie. `2021-07-02 14:05:10 [25f8353b-b545-44f8-853a-2074226b06d8]`(from `archive/poolcounter.log-20210703.gz`) [18:25:06] (03PS1) 10Andrew Bogott: Install tmpreaper on quarry web hosts, clean up temp files 4+ days idle [puppet] - 10https://gerrit.wikimedia.org/r/708150 (https://phabricator.wikimedia.org/T238375) [18:30:18] let me merge the followup patch, too [18:30:27] (03CR) 10Urbanecm: [C: 03+2] DPL PoolCounter setting: Fix class name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708148 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [18:30:45] I'm struggling to replicate it out of MW [18:30:57] :( [18:31:09] (03Merged) 10jenkins-bot: DPL PoolCounter setting: Fix class name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708148 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [18:37:25] that error happens when fgets() returns false: https://github.com/wikimedia/mediawiki-extensions-PoolCounter/blob/53bd12944fe995daf60d6483f68aa0b4eefc3c85/includes/Client.php#L78 [18:37:46] https://www.php.net/manual/en/function.fgets.php says fgets() returns false when there's no data or "an error occurs" [18:38:19] can you try looking at the network traffic? does poolcounter return anything? [18:38:55] yeah I can tcpdump, lets see [18:39:24] urbanecm: can you try verbosely previewing again? [18:39:29] on it [18:40:14] I wonder if we're hitting the DPL query cache [18:40:23] which is why it doesn't trigger on every view [18:41:01] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Bawolff) >>! In T263220#7237177, @jcrespo wrote: > @Bawolff I think you wrote the comment at https://gerrit.wikimedia.org/r/c/operations/mediawiki... [18:41:26] legoktm: do you want me to set wgDLPQueryCacheTime to zero at mwdebug? [18:41:42] that should disable cache [18:41:43] yeah... [18:42:21] did you notice that it's $wg*DLP*QueryCacheTime? sigh... [18:42:41] hehe [18:42:44] well, set to zero [18:42:49] and previewing [18:42:54] Dynamic Lage Pist [18:43:14] we can put it on the typos list along with WQDS [18:43:57] I'm not sure if that sounds like something that would cause more or less performance issues [18:44:53] i have literally zero idea what that error means, fwiw [18:44:55] entirely unrelatedly, the poolcounter errors for Special:Contributions/127.0.0.1 feel like another issue [18:45:28] urbanecm: which one? the "Error reading from pool counter server" one? [18:45:30] you know what's weird, I don't see where the lock is being released [18:45:43] debug logs DBQuery logs DPL SQL queries when i previeew [18:45:56] like, I see "ACQ4ME nowait:dpl-query:enwikinews 25 25 0" -> "LOCKED", but no follow-up "RELEASE" -> "RELEASED" [18:46:22] a few earlier I see RELEASEs for [18:47:02] (03PS1) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [18:47:23] `2021-07-26 18:46:17 [d9b5983b-9c84-480e-97d9-41598d5333b0] mwdebug2001 enwikinews 1.37.0-wmf.15 error-json DEBUG: {"id":"d9b5983b-9c84-480e-97d9-41598d5333b0","type":"ErrorException","file":"/srv/mediawiki/php-1.37.0-wmf.15/extensions/PoolCounter/includes/ConnectionManager.php","line":97,"message":"PHP Warning: fsockopen(): unable to connect to 10.192.0.132:7531 (Connection timed out)"` -- related? [18:47:29] (from XWikimediaDebug.log) [18:48:05] very possible [18:48:15] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [18:48:27] hmm [18:50:25] so far, no error in poolcounter.log though [18:50:35] should i keep previewing? [18:50:58] (no related, i mean) [18:51:08] probably not [18:51:10] (03PS2) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [18:51:11] the connection timed out is a warning, which means it should keep executing, right? [18:52:05] but if that's the case, based on my understanding it should fail with "poolcounter-write-error", not with "poolcounter-read-error" that it currently fails with [18:52:12] I think the fsockopen() error is unrelated, because if were weren't able to connect then fwrite.. yeah [18:52:23] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [18:52:38] also that only ever showed up once [18:52:43] ah [18:53:08] I don't have any other context than what you're posting here, so couldn't tell that :/ [18:53:17] no errors/warnings from fgets or fwrite either [18:53:57] https://www.php.net/manual/en/function.fgets.php says that it will return false if "an error occurs" [18:54:46] yeah, I don't see a way to get that underlying error though [18:54:58] we're almost at the end of the window [18:55:07] PROBLEM - Thanos query has high gRPC client errors on alert1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [18:55:09] legoktm: maybe deploy and tcpdump using live traffic? [18:56:13] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708153 [18:57:07] I'm not that confident [18:57:22] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1003/30348/" [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [18:57:27] just thinking out loud :) [18:57:51] the part that I'm really not sure about is that this is happening inside another poolcounter lock, usually [18:58:15] like parsing a page holds a lock, and inside of that we're trying to grab another one [18:58:32] but in my limited testing it appeared to work fine [19:00:25] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/ops [19:00:32] no logs on poolcounter2003 either [19:01:31] well we're out of time [19:01:46] so revert? [19:01:50] yeah, I think so [19:01:53] okay, doing [19:02:03] I see some logs where it appears to work properly, and some where it doesn't [19:02:05] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw+prometheus/global [19:02:25] e.g. 2591e9fb-cebe-48a1-a922-8a345a9b0b0d looks correct [19:03:17] RECOVERY - Thanos query has high gRPC client errors on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [19:04:22] (03PS1) 10Urbanecm: Revert DPL poolcounter setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708154 (https://phabricator.wikimedia.org/T263220) [19:04:38] (03CR) 10Urbanecm: [C: 03+2] Revert DPL poolcounter setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708154 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [19:05:28] (03Merged) 10jenkins-bot: Revert DPL poolcounter setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708154 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [19:05:56] revert merged, pulled to deployment [19:06:15] thanks [19:06:20] legoktm: thanks for your help during this hour :) [19:06:22] I'll write up some notes on the ticket in a bit [19:06:32] that'd be greatly appreciated [19:07:50] tbh I'm mostly interested in that this seems to be an issue with poolcounter. I'm not really convinced spending more time on DPL is worth it vs creating a bot/replacement [19:10:29] (03PS1) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708156 (https://phabricator.wikimedia.org/T287215) [19:12:19] (03PS3) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [19:13:36] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [19:16:57] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw+prometheus/global [19:21:23] (03PS1) 10Jdlrobson: Disable mobile contributions simplifications on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) [19:21:47] (03PS5) 10Cwhite: logstash: add gitlab ECS transformations [puppet] - 10https://gerrit.wikimedia.org/r/705019 (https://phabricator.wikimedia.org/T274462) [19:21:54] (03PS1) 10Ottomata: Add system users and groups for for Airflow for Research and Platform Eng [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) [19:22:26] (03CR) 10Jdlrobson: [C: 04-1] Enable user links on office + test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [19:22:38] (03CR) 10jerkins-bot: [V: 04-1] Add system users and groups for for Airflow for Research and Platform Eng [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [19:22:40] (03CR) 10jerkins-bot: [V: 04-1] Disable mobile contributions simplifications on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [19:23:00] (03PS2) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708156 (https://phabricator.wikimedia.org/T287215) [19:23:04] (03PS2) 10Jdlrobson: Disable mobile contributions simplifications on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) [19:23:29] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/ops [19:23:41] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30349/console" [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) (owner: 10Ottomata) [19:24:12] (03CR) 10Jdlrobson: [C: 04-1] "Blocked on 1.37.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [19:24:17] (03CR) 10jerkins-bot: [V: 04-1] Disable mobile contributions simplifications on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708158 (https://phabricator.wikimedia.org/T283988) (owner: 10Jdlrobson) [19:26:30] (03CR) 10Cwhite: [C: 03+2] logstash: add gitlab ECS transformations [puppet] - 10https://gerrit.wikimedia.org/r/705019 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [19:30:38] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Legoktm) @Urbanecm and I (plus lurker @majavah :)) spent an hour today trying to roll this out with mediocre success, but not enough confidence fo... [19:43:58] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Bawolff) ruwikinews now has 13M pages (bigger than enwiktionary). Even if th... [19:44:25] (03PS1) 10Cwhite: rsyslog: add gitlab input-file entries to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/708160 (https://phabricator.wikimedia.org/T274462) [19:45:07] (03PS2) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708153 [19:46:07] (03PS3) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708153 [19:50:29] (03PS2) 10Ottomata: Add system users and groups for for Airflow for Research and Platform Eng [puppet] - 10https://gerrit.wikimedia.org/r/708159 (https://phabricator.wikimedia.org/T284225) [19:51:35] (03PS4) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [19:52:44] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [19:58:17] (03PS5) 10Jdlrobson: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [19:59:23] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [20:00:05] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T2000). [20:00:17] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) >>! In T287380#7236461, @Jdforrester-WMF wrote: > IIRC, Wikimedia's DPL fork was created as part of the Wikivoyage migration rush because a few of th... [20:00:56] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10DonSimon) >>! In T287362#7237477, @Bawolff wrote: > ruwikinews now has 13M p... [20:02:29] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708153 (owner: 10Ahmon Dancy) [20:02:43] (03CR) 10Ahmon Dancy: [C: 03+2] "tested in train-dev" [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708153 (owner: 10Ahmon Dancy) [20:03:37] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/708153 (owner: 10Ahmon Dancy) [20:05:43] (03PS6) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [20:06:54] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [20:13:31] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10stjn) >>! In T287362#7237477, @Bawolff wrote: > Also like dude, this is the... [20:14:49] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10IN) >>! In T287362#7236363, @Aklapper wrote: > @IN: Obviously yes if you do... [20:15:43] (03PS7) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [20:16:58] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [20:21:26] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) @stjn This is about API. I did not observe such problem this time. [20:27:53] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Krassotkin) @Aklapper, this is not an exaggeration. I have attached screensh... [20:40:34] (03PS8) 10Clare Ming: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) [20:41:37] (03CR) 10jerkins-bot: [V: 04-1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [20:44:15] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) >>! In T287380#7236632, @Legoktm wrote: > I haven't fully reviewed the incident yet, but my understanding is that our DPL fork isn't as bad as the... [20:54:16] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Ladsgroup) @Krassotkin you can be as loud as you like (until you get banned... [20:58:10] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow" - https://phabricator.wikimedia.org/T287362 (10Peachey88) >>! In T287362#7237519, @DonSimon wrote: > ... Problems had start... [20:58:42] (03PS9) 10Jdlrobson: Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [21:00:05] Reedy and sbassett: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T2100). [21:18:55] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:20:45] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:17:09] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:18:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:21:41] (03PS3) 10Brennen Bearnes: add gitlab2001 to host_vars and variables [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [22:22:27] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] add gitlab2001 to host_vars and variables [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [22:23:15] (03CR) 10Brennen Bearnes: [C: 03+1] move gitlab rails exporter to port 8083 [puppet] - 10https://gerrit.wikimedia.org/r/707859 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [22:33:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Jclark-ctr) dumpsdata1004 A2. u11. id#11002. port31 dumpsdata1005. C2 u12. id#11003 port35 [22:34:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Jclark-ctr) [22:34:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Jclark-ctr) a:05Jclark-ctrβ†’03Cmjohnson [22:48:53] 10SRE, 10Data-Persistence-Backup, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10TJones) Any idea when someone might have time to look at this? I'm trying to avoid having to recreate code that I had on mwmaint1002, but I have another ticket that's... [22:53:15] (03CR) 10Legoktm: [C: 03+2] admin: Add cmacholan to ldap_only_users for "wmf" group access [puppet] - 10https://gerrit.wikimedia.org/r/708145 (https://phabricator.wikimedia.org/T287233) (owner: 10Legoktm) [22:56:10] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10Legoktm) 05Openβ†’03Resolved @CMacholan you are now a member of the "wmf" LDAP group, https://ldap.toolforge.org/user/CMacholan - please comment here/reopen... [22:58:44] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10Legoktm) [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210726T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:03:23] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10Legoktm) @NRodriguez we had a slight mixup, but I've updated the checklist and we're all set to add your access, I just wanted to confirm with you that you still need Superset acc... [23:04:01] oh, I'll sync out a patch then [23:04:16] (03CR) 10Legoktm: [C: 03+2] Increase lilypond version cache TTL to 1 hour [extensions/Score] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707430 (owner: 10Legoktm) [23:07:17] 10SRE, 10Datacenter-Switchover: Add step to rsync home dirs on mwmaint hosts before DC switchover - https://phabricator.wikimedia.org/T287303 (10Legoktm) >>! In T287303#7235323, @Volans wrote: > I personally disagree, untracked and unreviewed scripts should not run against production IMHO. And we shouldn't enc... [23:08:38] 10SRE, 10DynamicPageList (Wikimedia), 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) p:05Triageβ†’03High [23:08:53] 10SRE, 10MediaWiki-extensions-Score, 10SRE-swift-storage: pages with lilypond code that are generated by score extension have no encoding info set by server - https://phabricator.wikimedia.org/T287326 (10Legoktm) p:05Triageβ†’03Low [23:09:44] 10SRE, 10MediaWiki-extensions-Score, 10SRE-swift-storage: upload.wikimedia.org does not set content-encoding headers for Score-generated lilypond files - https://phabricator.wikimedia.org/T287326 (10Legoktm) [23:09:55] 10SRE, 10Datacenter-Switchover: Add step to rsync home dirs on mwmaint hosts before DC switchover - https://phabricator.wikimedia.org/T287303 (10Legoktm) p:05Triageβ†’03Low [23:12:19] 10SRE, 10User-MoritzMuehlenhoff: Monitor sensitive sysctl settings - https://phabricator.wikimedia.org/T287081 (10Legoktm) p:05Triageβ†’03Medium [23:14:43] 10SRE, 10Traffic, 10Sustainability (Incident Followup): LVS can't handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10Legoktm) p:05Triageβ†’03Medium [ Setting priority as part of clinic duty, please retriage if incorrect ] [23:15:02] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Legoktm) p:05Triageβ†’03High [23:15:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Legoktm) p:05Triageβ†’03High [23:15:46] 10SRE, 10SRE Observability (FY2021/2022-Q2): node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768 (10Legoktm) p:05Triageβ†’03Medium [23:16:12] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10Legoktm) p:05Triageβ†’03Medium [23:16:50] 10SRE, 10Traffic, 10observability, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10Legoktm) p:05Triageβ†’03Low [23:17:17] 10SRE: urldownloader2002 running out of disk space in root partition - https://phabricator.wikimedia.org/T286525 (10Legoktm) p:05Triageβ†’03Medium [23:17:37] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Legoktm) p:05Triageβ†’03Medium [23:19:00] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Legoktm) @jijiki is {T286482} a duplicate of this one? To me it looks like both tasks have basically the same checklist [23:19:05] 10SRE, 10Traffic, 10WikimediaDebug, 10Performance-Team (Radar): Allow ATS to route traffic to mwdebug deployment on kubernetes - https://phabricator.wikimedia.org/T286482 (10Legoktm) p:05Triageβ†’03Medium [23:21:39] 10SRE, 10Analytics: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) Yes, https://gerrit.wikimedia.org/g/operations/puppet/+/4a3bf542618f4550dfbe450452ddc9e6294ed1d3/modules/profile/manifests/analytics/jupyterhub.pp#61 is the cron But I'm not sure migrating it... [23:23:17] 10SRE, 10Analytics: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) Actually it's not sometimes, it's always missing. We've been getting this since the end of June at least, which is when I last cleaned out my root@ folder. [23:34:01] (03Merged) 10jenkins-bot: Increase lilypond version cache TTL to 1 hour [extensions/Score] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707430 (owner: 10Legoktm) [23:34:23] 10SRE, 10Analytics: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) >>! In T286442#7238156, @Legoktm wrote: > But I'm not sure migrating it to a timer fixes the underlying issue, which is that sometimes(?) `/srv/home` is missing. Nvm, it would. Even though the... [23:37:32] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.15/extensions/Score/includes/Score.php: Increase lilypond version cache TTL to 1 hour (duration: 00m 57s) [23:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:32] (03PS1) 10Legoktm: analytics: Migrate clean_jupyter_user_local_trash to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/708183 (https://phabricator.wikimedia.org/T286442) [23:44:35] (03PS1) 10Legoktm: analytics: Remove absented clean_jupyter_user_local_trash cron [puppet] - 10https://gerrit.wikimedia.org/r/708184 (https://phabricator.wikimedia.org/T273673) [23:45:44] (03CR) 10Legoktm: analytics: Migrate clean_jupyter_user_local_trash to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708183 (https://phabricator.wikimedia.org/T286442) (owner: 10Legoktm) [23:46:01] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Legoktm) p:05Triageβ†’03Lowest [23:46:13] 10SRE, 10Analytics, 10Patch-For-Review: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Legoktm) p:05Triageβ†’03Low [23:46:29] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman doesn't replace email in notice when changing subscription email - https://phabricator.wikimedia.org/T286149 (10Legoktm) p:05Triageβ†’03Low [23:46:43] 10SRE, 10Wikimedia-Mailing-lists: CVN Mailing list acts like user isn't subscribed - https://phabricator.wikimedia.org/T286147 (10Legoktm) p:05Triageβ†’03Medium [23:46:57] 10SRE, 10Wikimedia-Mailing-lists: Make auditing members of mailing lists bound to a user right easier - https://phabricator.wikimedia.org/T286122 (10Legoktm) p:05Triageβ†’03Low [23:47:35] 10SRE, 10Wikimedia-Mailing-lists: Set up spare lists host in codfw, document failover procedure - https://phabricator.wikimedia.org/T286071 (10Legoktm) p:05Triageβ†’03Low [23:49:47] 10SRE, 10Wikimedia-Mailing-lists: Set up spare lists host in codfw, document failover procedure - https://phabricator.wikimedia.org/T286071 (10Legoktm) I think a continuous rsync would be a bit overkill, but we could certainly start the rsync a day before we plan to failover. If not that, I think it's not the... [23:50:20] 10SRE, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066 (10Legoktm) p:05Triageβ†’03Low [23:50:47] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Legoktm) p:05Triageβ†’03Medium [23:51:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10Legoktm) p:05Triageβ†’03Medium [23:53:23] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) p:05Triageβ†’03Medium [23:53:31] 10SRE, 10Datacenter-Switchover: switchdc check on mwmaint for running PHP processes should ignore php-fpm processes - https://phabricator.wikimedia.org/T285804 (10Legoktm) p:05Triageβ†’03Low [23:55:30] (03CR) 10Jdlrobson: [C: 03+1] Enable user links on office + test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708152 (https://phabricator.wikimedia.org/T287391) (owner: 10Clare Ming) [23:56:23] 10SRE, 10serviceops: php7.2-fpm_check_restart should be resilient to php7adm error pages - https://phabricator.wikimedia.org/T285593 (10Legoktm) p:05Triageβ†’03Medium [23:56:48] 10SRE: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Legoktm) p:05Triageβ†’03Medium [23:57:17] 10SRE, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10Legoktm) p:05Triageβ†’03Low [23:57:35] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10Legoktm) p:05Triageβ†’03Low