Thursday, March 29, 2018

Enabling changes to Default Services within applications during deployment on a Service Fabric cluster

As it turns out, as of Service Fabric runtime 6.1, allowing applications to change their Default Services during an upgrade is not enabled by default. This can be controlled by the setting called 'EnableDefaultServicesUpgrade' in the cluster level setting group called 'ClusterManager'. This setting can be set in the Cluster Manifest if you're managing your own cluster on-premise, or it can be set in an ARM template if deploying to an Azure-based cluster like so:

{
"apiVersion": "2016-09-01",
"type": "Microsoft.ServiceFabric/clusters",
"name": "[parameters('clusterName')]",
"location": "[parameters('location')]",
"dependsOn": [
"[variables('supportLogStorageAccountName')]"
],
"properties": {
"certificate": {
"thumbprint": "[parameters('certificateThumbprint')]",
"x509StoreName": "[parameters('certificateStoreValue')]"
},
"clientCertificateCommonNames": [],
"clientCertificateThumbprints": [],
"clusterState": "Default",
"diagnosticsStorageAccountConfig": {
"blobEndpoint": "[reference(concat('Microsoft.Storage/storageAccounts/', variables('supportLogStorageAccountName')), '2017-06-01').primaryEndpoints.blob]",
"protectedAccountKeyName": "StorageAccountKey1",
"queueEndpoint": "[reference(concat('Microsoft.Storage/storageAccounts/', variables('supportLogStorageAccountName')), '2017-06-01').primaryEndpoints.queue]",
"storageAccountName": "[variables('supportLogStorageAccountName')]",
"tableEndpoint": "[reference(concat('Microsoft.Storage/storageAccounts/', variables('supportLogStorageAccountName')), '2017-06-01').primaryEndpoints.table]"
},
"fabricSettings": [
{
"parameters": [
{
"name": "ClusterProtectionLevel",
"value": "[parameters('clusterProtectionLevel')]"
}
],
"name": "Security"
},
{
"parameters": [
{
"name": "EnableDefaultServicesUpgrade",
"value": "[parameters('enableDefaultServicesUpgrade')]"
}
],
"name": "ClusterManager"
}
],
"managementEndpoint": "[concat('https://',reference(variables('lbIPName')).dnsSettings.fqdn,':',variables('nt0fabricHttpGatewayPort'))]",
"nodeTypes": [
{
"name": "[variables('vmNodeType0Name')]",
"applicationPorts": {
"endPort": "[variables('nt0applicationEndPort')]",
"startPort": "[variables('nt0applicationStartPort')]"
},
"clientConnectionEndpointPort": "[variables('nt0fabricTcpGatewayPort')]",
"durabilityLevel": "Bronze",
"ephemeralPorts": {
"endPort": "[variables('nt0ephemeralEndPort')]",
"startPort": "[variables('nt0ephemeralStartPort')]"
},
"httpGatewayEndpointPort": "[variables('nt0fabricHttpGatewayPort')]",
"isPrimary": true,
"vmInstanceCount": "[parameters('nt0InstanceCount')]"
}
],
"provisioningState": "Default",
"reliabilityLevel": "Silver",
"upgradeMode": "Automatic",
"vmImage": "Windows"
},
"tags": {
"resourceType": "Service Fabric",
"displayName": "IoT Service Fabric Cluster",
"clusterName": "[parameters('clusterName')]"
}
}

Thursday, March 22, 2018

Troubleshooting connections to a service running in Service Fabric in Azure

Recently I decided to try deploying a Service Fabric cluster to Azure and investigate what it takes to create applications with Service Fabric. I used the default ARM template to deploy the cluster, with the parameters in the parameter file set appropriately. I've been able to successfully deploy the cluster itself, along with a private sample application that's showing as running and healthy within the cluster. (all of this with VSTS, but that's for another post). I'm now running into the problem of actually connecting to the Service.

When I use postman to connect to the service, I'm connecting to a URL like :

https://mycluster.westus.cloudapp.azure.com:8870/api/things

However, when I send my request, I instantly see Postman fail to connect:



Here's the steps taken when verifying all the settings so far:

  • I got the address for my cluster from the "IP Address" resource that came with the ARM template that's in the Azure portal, using the "Copy" functionality. 
  • I've verified that the Load Balancer that was set up by the ARM template is correctly configured to use my application ports.
  • I've consulted the documentation for Load Balancer health probes here to ensure that my machines are using the correct type of probe: https://docs.microsoft.com/en-ca/azure/load-balancer/load-balancer-custom-probe-overview#learn-about-the-types-of-probes . In my case, I don't want to use HTTP even though I've got an HTTP service because I'd need a 200 response. Instead, we use the more basic TCP probe which determines health status based on TCP handshake, which should be just fine.
  • I've updated the Diagnostics lettings on the Load Balancer to pipe logs out to a storage account. Using the generated output, I've found that my health probe contradicts my expectations and is in fact failing. What's worse, there's a timing issue: the health probe fails too many times before the Service Fabric Host can startup on the VMs and then permanently marks the hosts as failed, preventing accessibility to any of my VMs, which would seem to explain my inability to connect to my services AND the speed with which the response is returned (because the traffic doesn't even get past the Load Balancer).
  • I've used Remote Desktop to gain access to the Virtual Machine Scale Set VMs thanks to the default settings that came with the Service Fabric ARM template. Loading up PowerShell and executing the command "iwr -Method Get -Uri https://localhost:8870/api/Things" yields the error "iwr : The underlying connection was closed: An unexpected error occurred on a send". It would seem that I can't even get an actual connection to my service working on the local machine. This would explain why the health probes are failing: they're perfectly legit.
  • Running the command "netstat -an | ? { $_ -like '*8870*' }" on the VMSS VM indicates that the Service Fabric Host has in fact launched my process on my expected port of 8870 and the process is listening on that port. Curiously, I'm also seeing an established connection on that port as well. This is at least somewhat consistent with the fact that the Service Fabric Management Portal is showing my service as healthy on all nodes, but inconsistent with the status of the health probe.
  • Figuring that I now have a past problem that I already solved, I tried setting the permissions of the certificate stores for the certificates my API application uses. After some time waiting for the load balancer health probes to update, they were now able to connect and the services were running properly.