I saw a scary number of Lync 2013 deployments in the last 8 months where the Lync is deployed using an Enterprise pool with only two front end servers, even I saw a couple that have an Enterprise pool with only one front end server, yes you read that right, only one front end server.
So I decided to write another post of my “Simple Understanding” article series aimed to explain the Lync 2013 server architecture, how it utilize windows fabric for high availability and why you should not deploy a 2 nodes Enterprise Edition pool, I’ll try to use small words and simple explanations as I can.
I’m planning also to use this article as a guide to share with customers which have an existing Lync deployment or considering Lync to help them in their decisions.
In previous version of Lync 2010, user services was provided by the Backend database, and that’s why when the SQL backend goes down, the Lync clients goes into “Limited Functionalities” status. But this has changed with Lync 2013 server architecture, where users services were moved from the backend databases to the Lync Front end servers.
what I see best advantages of this approach are:
- Less dependent on SQL backend
- Can scale further by adding more Front end servers to the pool
- Each user’s data kept on the front end servers in pool
Windows Fabric Deep Dive
Microsoft Best Practices: Windows Fabric is a kind of clustering technology, with Lync 2013 Microsoft recommended to use 3 nodes in an Enterprise pool, and in case you cannot use 3 nodes in a pool then deploy two standard edition pools and use pool pairing for high availability.
NOTE: Windows Fabric is configured automatically every time the services starts up, as an administrator there is nothing you need to do regarding Windows Fabric, the configuration can be found inside Manifest file of the Windows Fabric located in the following path
C:\Program file\windows fabric\bin\clustermanifest.current
In my lab I have one Enterprise pool with 3 front ends in it LYFE01, LYFE02 and LYFE03, every user enabled for Lync in this pool gets a primary front end server, and two backup servers, one which is the primary backup server and another is a secondary backup server
this can easily be found by using PowerShell command line
C:\> Get-CsUserPoolInfo –identity “user”
As you see in the output of the command line above the user has:
- Primary front end= LYFE02 (keep it in mind)
- Primary Backup front end = LYFE01
- Secondary Backup front end = LYFE03
As agreed so far each user have a primary front end, this info is written in the SQL backend databases with a little tinny difference, when it comes to SQL backend each Lync user is a Routing group. the number of Routing Groups increases when you add more users to the front end pool, you find Routing Groups in the RTC database inside the RoutingGroupAssignment table (use the following query):
SELECT TOP 1000 [RoutingGroupId]
In my lab I have only one user enabled for Lync, so when checking Routing Groups in the RTC database I see only one Routing Group (does not matter which SQL you’re checking because Lync is not depending on the backend anymore)
Each Routing Group has a FrontEndID associated to it, this is the Primary Front end server for that Routing Group (User), in my lab it has the value “3”
You can “decrypt” this value by checking the FrontEndID table, this table include the front end nodes in the pool (run the following query)
SELECT TOP 1000 [FrontEndId]
So as you see, my only Routing Group (user) has FrontEndID server “3” which is LYFE02, that’s the same result we got using Get-CsUserPoolInfo command line up in this article.
SIDE NOTE: the Routing Group Id is written to the Active Directory user account under the attribute msRTCSIP-UserRoutingGroupId in reverse order
How it Works
I would summarize how windows fabric works in 4 simple steps
- First time front end service “Starting”, the front end server connect to the SQL backend and collect the “User’s Information” of all users that it is responsible for.
- Once front end server finish collecting user’s information from the SQL backend, it replicate user’s information to both the primary backup and backup servers.
- From this point on, it is the primary front end responsibility to write all new user’s information to the SQL backend (like user create a new conference or add a new user) as well as replicate them to both primary backup and secondary backup servers.
- Front end services goes into “Started” status.
Front end Failover
one of the user I have in my lab (I added more users) has LYFE02 as the primary front end server, LYFE03 as the primary backup and LYFE01 as the secondary backup front end, so what happens when the primary front end server (LYFE02) for that user is down?
- The fabric pool manager promote the primary backup front end to be the conference directory – Event ID 51037
- Fabric pool manager make some changes to the MCU Factory including the chat, phone conference ..etc. – Event ID 51035
- Fabric Pool Manager mark the primary front end as inactive – Event ID 32108
- the primary backup font end LYEF03 is promoted to be the primary front end for the user “Routing Group” – LS User Services Event ID 32167
- the primary storage services will be assigned to the front end (I noticed it is always the secondary backup front end, but I’m not sure if this is the case always) – Lync Storage Services Event ID 32033
- User information will not be affected, because the primary Front end was always replicating the user information to the primary-backup and secondary-backup servers.
if I run get-csuserpoolinfo on the same user now I can see that the primary back up front end server got promoted to be the primary front end server, and the secondary backup front end became the primary backup front end.
so now that you understand how Lync utilize Windows Fabric and how does it work, give you some notes that you keep in mind
- As mentioned before Microsoft recommend using 3 front ends when deploying Enterprise pool, if you cannot deploy 3 front ends, then use two standard edition pools with pool pairing
- Windows Fabric is a kind of failover cluster, and it need an odd number vote (like a witness-servers) to maintain the Pool-level quorum
- According to TechNet, following table show the total number of Front ends needed to be running in a pool to maintain the Pool-Level quorum
- Lync 2013 server still need SQL backend, and if for any reason the SQL backend is unavailable, Lync Front Ends goes into survivable mode after 30 minutes.
In case minimum number of Front ends in a pool is not met
in case number of front ends in a pool is not met, the front end services start shutdown after 5 minutes, in a nutshell the following happens
- LS User Services Event ID 32163: the Pool manager will disconnect from the Fabric pool manager due to loss of quorum
- LS User Services Event ID 32189: Pool Fabric will disconnect the users (close routing group connections)
LS User Services Event ID 32170: Pool manager is trying to connect to fabric pool manager and failing, make sure 85% of front ends are up and running
LS User Services Event ID 32173: Lync Front ends server will start shutting down after 5 minutes
Two Front End nodes Pool
- in case of a two servers in Enterprise pool, Lync uses the SQL backend server as the witness server to maintain the Pool-level quorum, in case you have Mirrored SQL, Lync will use the primary SQL server in the Mirror as a witness server
- to confirm that, if you check the Windows Fabric “ClusterManifest.current” file, and compared it to the one mentioned above in this article (from 3 Front Ends Pool) you will notice addition section called “Votes”, where Windows Fabric uses the SQL server in the Votes to maintain Pool-Level quorum.
Primary SQL Backend Outage
- if for any reason the Primary SQL backend is unavailable the following happens (no SQL Mirror is used)
- the Lync clients goes into “Limited functionalities” mode same like what happen in Lync 2010 server.
- if the SQL backend not brought back online in 30 minutes, Lync Front Ends goes into survivable mode.
- if for any reason the SQL backend is unavailable, and the minimum number of Front Ends need to be running in the pool is not met, Front End services shut down after 5 minutes (check table above).
following short video summarize the cases of a two Front Ends pool
if your customer is using Virtual infrastructure like Hyper-v or VMware to host the Lync servers, make sure to divide your front ends across the physical hosts, especially when dealing with large number of front ends in pool, just make sure you don’t put all the eggs in one basket, you don’t want to lose a physical host which have users’ primary front end, primary backup front end and the secondary backup front end running on it, the user will go offline till one of those front ends is brought back online
what you would want to have
What you would NOT want to have
NOTE: with new improvements in Hyper-V and VMware, engineer can utilize the “Live Migration” option and their infrastructure resources to make sure when a physical node is down, the front ends are migrated automatically to the online physical host.
this cover most of the points Lync specialist or a customer considering deploying Lync need to know and take in consideration in the planning phase of the project