SharePoint 2013 distributed cache bug

Update June 10, 2014: Sometime recently Microsoft announced they will support SharePoint 2013 farms where customers deploy an AppFabric update. Microsoft is releasing AppFabric and SharePoint updates independent of each other and expect customers to patch each system as needed. Check out KB 2843251: Installing Windows Server AppFabric updates SharePoint 2013 Servers for more information.

As well, since originally publishing this post I've built many more SharePoint 2013 farms (I'm not bragging, setting up farms is my job). What I've found is that in nearly all cases simply installing AppFabric CU4 and enabling the background garbage collection feature resolves the distributed cache issues. My original guidance stills stands: if you are experiencing the issue deploy the AppFabric CU and if you continue having issues only then consider messing with the distributed cache settings (and only if you know what you're doing).


While implementing a SharePoint 2013 farm we encountered an intermittent issue with distributed cache that we resolved with the help of Microsoft Support. This SharePoint 2013 farm allows customers to log in using a custom forms-based authentication provider. Occasionally, a user would click on link and instead of receiving the expected page they would unexpectedly be redirected to the sign in page where they were prompted to log in again.

Looking at the ULS logs we immediately saw there was an issue with Distributed Cache. The logs were full with dozens of exceptions every second, most (about 60 a second) were being suppressed.

Primarily we saw five exceptions:

Product: SharePoint Foundation
Category: DistributedCache
Level: Unexpected

Message:
Unexpected error occurred in method 'GetObject' , usage 'Distributed Logon Token Cache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out..
Additional Information : The client was trying to communicate with the server : net.tcp://cacheserver.example.com:22233
 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody)
 at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener)
 at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass49.b__48()
 at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)'.
Product: SharePoint Foundation
Category: DistributedCache
Level: Medium

Message:
Unexpected error occurred in method 'GetObject' , usage 'Distributed Logon Token Cache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server : net.tcp://cacheserver.example.com:22233
 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody)
 at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener)
 at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass49.b__48()
 at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)'.
Product: SharePoint Foundation
Category: DistributedCache
Level: Medium

Message:
Token Cache: Failed to get token from distributed cache for '0#.f|provider|username'.(This is expected during the process warm up or if data cache Initialization is getting done by some other thread).
Exception: 'Microsoft.SharePoint.DistributedCaching.SPDistributedCacheClientRequestTimeOutException: Communications with the cache cluster has experienced a delay past the timeout value,please increase the RequestTimeout of the client. ---> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out..
Additional Information : The client was trying to communicate with the server : net.tcp://cacheserver.example.com:22233
 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody)
 at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener)
 at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass49.b__48()
 at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)     -
 -- End of inner exception stack trace ---
 at Microsoft.SharePoint.DistributedCaching.SPDistributedCache.GetObject(String key)
 at Microsoft.SharePoint.IdentityModel.SPDistributedSecurityTokenCache.GetObject(String key)
 at Microsoft.SharePoint.IdentityModel.SPTokenCache.TryGetCachedToken(String cacheKey)'.
Product: SharePoint Server Search
Category: QueryCache
Level: Unexpected

Message:
SearchDistributedCache::PutAction() - Failed due to exception = 'Microsoft.Office.Server.DistributedCaching.SPDistributedCacheClusterDownException: Cache cluster is down, restart the cache cluster and Retry ---> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server : net.tcp://cacheserver.example.com:22233
 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody)
 at Microsoft.ApplicationServer.Caching.DataCache.InternalPut(String key, Object value, DataCacheItemVersion oldVersion, TimeSpan timeout, DataCacheTag[] tags, String region, IMonitoringListener listener)
 at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass25.b__24()
 at Microsoft.ApplicationServer.Caching.DataCache.Put(String key, Object value, TimeSpan timeout)
 at Microsoft.Office.Server.DistributedCaching.SPDistributedCache.Put(String key, Object value)
 --- End of inner exception stack trace ---
 at Microsoft.Office.Server.DistributedCaching.SPDistributedCache.Put(String key, Object value)
 at Microsoft.Office.Server.Search.Query.SearchDistributedCache.PutAction(String key, Object value)'
Product: SharePoint Server Search
Category: QueryCache
Level: Unexpected

Message:
DistributedSearchResultsCache::Get() - Failed due to exception = 'Microsoft.Office.Server.DistributedCaching.SPDistributedCacheClusterDownException: Cache cluster is down, restart the cache cluster and Retry ---> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server : net.tcp://cacheserver.example.com:22233
 at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody, RequestBody reqBody)
 at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener)
 at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass49.b__48()
 at Microsoft.Office.Server.DistributedCaching.SPDistributedCache.GetObject(String key)
 --- End of inner exception stack trace ---
 at Microsoft.Office.Server.DistributedCaching.SPDistributedCache.GetObject(String key)
 at Microsoft.Office.Server.Search.Query.SearchResultsDistributedCache.Get(String key)'

Tracing through the logs, we saw that when a user accesses a page, SharePoint attempts to authorize the user to ensure they have access. SharePoint stores the user’s token in the user's browser session and in the DistributedCacheLogonTokenCache container. When SharePoint tried to retrieve the token from distributed cache, the connection would time out or a connection would be unavailable and the comparison would fail. Since it couldn't validate the presented token SharePoint had no choice but to log the user out and redirect them to the sign in page.

Besides exceptions for the DistributedLogonTokenCache, we also saw exceptions for the DistributedSearchCache and noticed Search queries took longer than expected. The search cache exceptions all related to a timeout or connection issue.

These exceptions pointed to a number of configurations that we investigated:

  • RequestTimeout
  • MaxBufferSize
  • MaxConnectionsToServer

We started with focusing our efforts on the DistributedLogonTokenCache. The default values for the Distributed Logon Token Cache are these:

ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 20
RequestTimeout               : 20
MaxConnectionsToServer       : 1

These values have no meaning. What are the units for the ChannelOpenTimeOut and RequestTimeout? The ChannelInitializationTimeout is much larger at 60000, so maybe it’s milliseconds. Are RequestTimeout and ChannelOpenTimeout then 20 milliseconds? That seems really small. Maybe it’s 20 seconds? The MSDN page for RequestTimeout doesn’t provide an answer so we initially had to guess. In our development environments we were able to reproduce the issue when we reduced the time outs to a value of “5”. So we tried increasing them to 40 in the test environments. Then 60. Then 120. The issue persisted.

When you drill down into some of the other distributed cache containers like the Search container, you see that the RequestTimeout and ChannelOpenTimeOut default values are 3000.

When we contacted Microsoft they also didn’t know what the units were. They had to check the SharePoint source code to find out. They’re milliseconds. It’s easy to find what you're looking for when you know what to look for and when we look back at the logs we can see the DistributedLogonTokenCache requests and failures are about 20 ms apart and there are 3000 ms between these events for other cache containers like DistributedSearchCache.

Microsoft immediately agreed with our assessment -- increase the timeout values. Now that we knew these were milliseconds we tried increasing the DistributedLogonTokenCache's RequestTimeout and ChannelOpenTimeout from 120 ms to 3000 ms.

We also:

  • increased the MaxConnectionsToServer from 1 to 100, as we were seeing dozens of connection attempts to the cache servers
  • increased the maxBufferPoolSize to 1073741824
  • increased the maxBufferSize to "33554432"
  • increased the RequestTimeout to "3000"
  • increased the ChannelOpenTimeOut to "3000"

You can only configure this with PowerShell. To ensure the changes apply, restart the AppFabric Windows service. We ran the following script on our dedicated cache server:

Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache

$DLTC = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache
$DLTC.maxBufferPoolSize = "1073741824"
$DLTC.maxBufferSize = "33554432"
$DLTC.requestTimeout = "3000"
$DLTC.channelOpenTimeOut = "3000"
$DLTC.MaxConnectionsToServer = "100"
Set-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache $DLTC
Restart-Service -Name AppFabricCachingService

As well, to rule out issues with the Security Token Service, we increased some if its values:

  • increased MaxServiceTokenCacheItems to 1500
  • increased MaxLogonTokenCacheItems to 1500

And in PowerShell:

$sts = Get-SPSecurityTokenServiceConfig
$sts.MaxServiceTokenCacheItems = "1500"
$sts.MaxLogonTokenCacheItems = "1500"
$sts.Update()

Followed by resetting IIS with IISRESET on all servers in the farm.

The issue was reduced significantly. Instead of dozens of exceptions a second there were only a few hundred an hour. The frequency of users being logged out deceased so we knew these changes helped. At this point it became harder to reproduce the issue, but it was still present.

We stepped up the logging. On the WFE servers and the cache server we installed Network Monitor, Performance Monitor counters, increased the level of SharePoint logging for distributed cache to VerboseEX with Set-SPLogLevel, added event trace logs, and captured Fiddler logs from the workstations where testers were testing the site. We finally reproduced the issue and with these logs Microsoft was able figure out the issue was because of a known bug in the way garbage collection in AppFabric 1.1 is implemented on .NET Framework v4.5.

The good news is Microsoft had already released a hotfix as part of AppFabric Cumulative Update 3. Since then CU4 has been released and since CUs are cumulative AppFabric CU4 contains the fix as well. We applied CU4 to all SharePoint servers in the farm and modified the DistributedCacheService.exe.config file to include the backgroundGC key. Our config file looks something like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
   <configSections>
      ... other configurations ...
   </configSections>
   <appSettings><add key="backgroundGC" value="true"/></appSettings>
   <dataCacheConfig>
   ... other configurations ...
</configuration>

(Update January 30, 2014: HT to Aben Samuel and Gavin Barron for discovering that the appSettings element neets to go between configSections and dataCacheConfig)

Once applied we stopped and started the Distributed Cache service from Central Administration (Application Management -> Services on server). The timeouts and exceptions went away. Users stopped being logged out of the site. Search queries returned in less than a second instead of after 3 or more seconds.

Summary

Issue

Out of the box, AppFabric 1.1 contains a bug with garbage collection. AppFabric 1.1 is a prerequisite for SharePoint 2013 as it is the underlying technology used by the Distributed Cache service.

Affects

SharePoint Server 2013 + March Public Update

Symptoms

Due to the bug, some requests to Distributed Cache time out. In our case, users authenticated to a SharePoint using formed based authentication were unexpectedly logged out of the site because the check for their logon token timed out. As well, requests from the search cache timed out after three seconds increasing the time to load search results.

A review of the ULS logs showed a number of distributed cache exceptions:

Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:The request timed out..
Additional Information : The client was trying to communicate with the server : net.tcp://cacheserver.example.com:22233
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).
Additional Information : The client was trying to communicate with the server : net.tcp://cacheserver.example.com:22233
Microsoft.Office.Server.DistributedCaching.SPDistributedCacheClusterDownException: Cache cluster is down, restart the cache cluster and Retry ---> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.).

Resolution

  1. Apply AppFabric Cumulative Update 3, AppFabric Cumulative Update 4, or a later AppFabric CU to all servers in the farm
  2. Add backgroundGC key to DistributedCacheService.exe.config file on all cache servers
  3. Restart AppFabric Windows Service on all cache servers
  4. Restart Distributed Cache SharePoint service on all cache servers
  5. Reset IIS (IISRESET) on all servers in the farm

If the issue persists, you may need to increase timeout and connection values:

  1. Increase distributed cache client settings for affected containers using the Set-SPDistributedCacheClientSetting cmdlet.
  2. Increase security token service values with Get-SPSecurityTokenServiceConfig
  3. Restart AppFabric, and Distributed Cache on cache servers

External References

Configuration References

Distributed Cache containers

DistributedAccessCache
DistributedActivityFeedCache
DistributedActivityFeedLMTCache
DistributedBouncerCache
DistributedDefaultCache
DistributedLogonTokenCache
DistributedSearchCache
DistributedSecurityTrimmingCache
DistributedServerToAppServerAccessTokenCache
DistributedViewStateCache

Get the default cache client Settings for all Distributed Cache containers

Get-SPDistributedCacheClientSetting -ContainerType DistributedAccessCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedActivityFeedCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedActivityFeedLMTCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedBouncerCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedDefaultCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedSearchCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedSecurityTrimmingCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedServerToAppServerAccessTokenCache
Get-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache

Default client Settings (SharePoint Server 2013 RTM + March 2013 PU)

PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedAccessCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1



PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedActivityFeedCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1



PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedActivityFeedLMTCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1



PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedBouncerCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1



PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedDefaultCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1



PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 20
RequestTimeout               : 20
MaxConnectionsToServer       : 1



PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedSearchCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1



PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedSecurityTrimmingCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1






PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedServerToAppServerAccessTokenCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 3000
RequestTimeout               : 3000
MaxConnectionsToServer       : 1
PS> Get-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache


ChannelInitializationTimeout : 60000
ConnectionBufferSize         : 131072
MaxBufferPoolSize            : 268435456
MaxBufferSize                : 8388608
MaxOutputDelay               : 2
ReceiveTimeout               : 60000
ChannelOpenTimeOut           : 20
RequestTimeout               : 20
MaxConnectionsToServer       : 1

Stories say it best.

Are you ready to make your workplace awesome? We're keen to hear what you have in mind.

Interested in learning more about the work we do?

Explore our culture and transformation services.

Our commitment to reconciliation

Learn how Habanero is responding to the Truth and Reconciliation Calls to Action as a settler-owned company operating on Indigenous territories across what is now called Canada.

Read about our commitment