Back to Cleverworkarounds mainpage
 

Dec 21 2013

Trials or tribulation? Inside SharePoint 2013 workflows–Part 2

Hi and welcome to part 2 of a series that aims to showcase the good and the bad of SharePoint 2013 workflows, using a simple use-case. In part 1 I introduced you to the mythical multinational called Megacorp and its document control requirements. Also in part 1, we created a site with the necessary lists and libraries from which we can build the workflow. To recap, we used the Document Center site template, and then modified the built-in document library (Documents) to store the organisation that each document belongs to. We created a custom list called Process Owners that stored who is accountable for controlled documents in each organisation. Below is an XMind map to show you the Information Architecture of the Megacorp document control site.

image_thumb1

Our workflow is going to:

  • Look at a document and determine the organisation that a document belongs to
  • Look in the process owners list for that organisation and then determine the process owner for the organisation by grabbing the information in the “Assigned To” field
  • Create an approval task for the process owner to approve the release of a controlled document.

Now this all sounds straightforward enough in theory, but this is SharePoint we are talking about. Let’s see how reality stacks up against theory. I will give lots of screenshots for readers who have never tried using SharePoint Designer Workflow before…

Creating the workflow

Step 1:

Right, so the workflow will fire when a document is modified, so we will start by adding a list workflow and link it to the Documents library. So using SharePoint Designer 2013, we open the Megacorp site and in the ribbon, we click the List Workflow button. This will cause a dropdown menu to appear below the icon, showing the lists and libraries on the Megacorp site. From this list, we choose the Documents library.

image    image_thumb23

Step 2:

We are presented with the workflow creation screen. Call the workflow “Process Owner Approval” and click OK.

image_thumb9

Note: If, at this point, you get an error, it means that Workflow manager has not been provisioned properly. Remember that in SharePoint 2013, workflow is a separate product and needs to be installed. This is a change from previous versions of SharePoint where workflow was always there, as it was part of SharePoint.

Assuming SharePoint is configured right, SharePoint Designer will have a think for a while and eventually display the following workflow creation screen, ready for us to do cool things!

image_thumb12

The next step is to add a workflow action. The most logical workflow action for us to try is one of the task actions. We will attempt to assign a task to the process owner for a document.

Step 3:

From the workflow ribbon, click the Action icon and choose “Assign a task”. The action will be added to the workflow, ready for you to fill in. For those of you reading this who have never done this before, the process is reminiscent of configuring email rules in outlook in that the action is added and you then fill in the blanks..

image_thumb15   image_thumb18

Now let’s configure the finer details of the Assign action above. A task needs to be assigned to a user, and we need to set up the logic for the workflow to work out who that user is. If you recall, the Documents library has a column called Organisation that specifies which Megacorp entity owns each document. A separate list called Process Owners then stores whoever is assigned as the process owner for each organisation (using the Assigned To column). So when the workflow runs on a particular document, we have to take Organisation specified for that document and use it to search the Process Owners list for the matching Organisation. Once we find it, we grab the value of the Assigned to user and create a task for them.

Here is a dodgy diagram that explains the logic I just described.

Snapshot

So let’s give it a go…

Step 4:

Go back to your newly minted “assign task” that you just added. Click the “this user” hyperlink on the newly created task. The task properties screen will appear as shown below. In the Participant textbox, click the ellipses button to bring up the Select User dialog box.

image_thumb19  image_thumb24

In the Select User dialog box, choose Workflow Lookup for a User and click the Add >> button. This will bring up the ambiguously named Lookup for string dialog box that allows us to choose where to look for our process owner. In the Data source drop down, we choose Process Owners from the list.

Next we have to tell the workflow which column holds the details of the person to perform the task. In the Field from source drop down, choose the Assigned to column as we talked about above.

image_thumb27image

At this point, the workflow knows which list and column holds the information it needs as per the image below:

Snapshot,

But currently we do not know the specific user we need. Is it John Smith, Jane Doe or Jack Jones? Fear not though – this is what the Find the list item section of the Lookup for string dialog box is for. This is where we will tell the workflow to find only the person who matches the value of the Organisation column assigned to the document.

Step 5:

First up, we tell the workflow which field in Process Owners will be used to compare. This is the Organisation column, so in the Field: drop down, choose Organisation. Next we have to match that organisation column with the one in the Documents library. Click the Fx button next to column textbox called Value. A new dialogue box will appear called Lookup for Extended Field. Leave the Data source dropdown as Current item and in the Field from source dropdown, choose the Organisation column.

image   image

Note 1: in a workflow, “Current item” refers to the item that the workflow has been invoked from. If you recall at the start of this post, we created a list workflow and associated with the Documents library. Current item refers therefore to any document in the Documents library that has had a workflow invoked. All of the metadata associated with the current document is available to the workflow to use.

Note 2: Keen eyed readers may be wondering what the deal is with the column labelled Organsiation_0. Don’t worry – I’ll get to that later.

Now that you have selected the organisation column from the Documents library, we have filled in all of the logic we need to grab the right Assigned to user. Clicking ok, the workflow designer will warn you that if you have been silly enough to add multiple process owners for a given organisation, that it will use the first one it finds.

image  image

Finally, we are back at the task designer screen. A lot of screenshots just to wire up the logic for finding the right process owner eh? We haven’t even configured the behaviour of the task itself yet!

image

In fact, for now we are not going to wire up the rest of the task. I would like to know if the logic we have just wired up will work. If I can confirm that, I will come back and finish off creating the task with the behaviours I want.  So instead, let’s click Ok to go back to the workflow designer. Now we can see our Assign a task action, ready to go.

image_thumb39

Now before we can test and publish the workflow, we need to tell it to end. Now this is a little counter intuitive to someone who is used to SharePoint 2010 workflows, but it makes sense once you understand the concept of a workflow stage.

SharePoint Workflow 2013 Stages – an interlude…

If you have ever sat around and tried to map out a business process, you have probably experienced the fun (not) of discovering that even a relatively simple business process has a lot of variations to it – some quite ad-hoc or dynamic. Trying to account for all of these variations in workflow design can make for some very complex and scary diagrams with even scarier implementations. As an example, take a look at the diagram below – this is a real process and apparently it is only page 12 of a 136!

image

Now in SharePoint 2010, implementing these sorts of workflows was pretty brutal, but in 2013 it has been rethought. A stage is a construct that allows you to group a number of workflow actions together, as well as defining conditions that govern how those actions happen. Each stage in a workflow can transition to any other workflow stage based on conditions. This means that a workflow can effectively loop around different stages and greatly simplify implementing business logic compared to SharePoint 2010.

The means by which this is done is via the the new Transition to Stage workflow action. The way this works is at the end of each workflow stage, there is a transition to stage section as shown below:

image

Step 6:

When you click into the Transition to stage section of the workflow, there is only one workflow action available: the Go To A Stage action. Adding this to the workflow will present a dropdown that will allow you to transfer the workflow to any the you have defined, or to end the workflow. Right now we will end the workflow, but we will be revisiting this stage idea later in the series.

image_thumb40  image_thumb41  image_thumb43

All right! We are done. The final step is to save and publish the workflow.

Step 7:

In the ribbon, look for the Publish icon and click it. Assuming everything goes according to plan, you will be back at the workflow overview screen in SharePoint Designer.

image  image

Congratulations – you have published your workflow. “Well that was easy,” I hear you say… But we haven’t tested it yet! In the next post we will test our masterpiece and see what happens. You might already have an inkling that the result may not be what you expect… but let’s save that for the next post.

Thanks for reading

Paul Culmsee

HGBP_Cover-236x300.jpg

www.hereticsguidebooks.com

 

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Sep 04 2013

Troubleshooting PerformancePoint Dashboard Designer Data Connectivity to SharePoint Lists

image

Here is a quick note with regards to PowerPivot Dashboard Designer connecting to SharePoint lists utilising Per-user identity on the single server.  The screenshot above shows what I mean. I have told Dashboard designer to use a SharePoint list as a data source on a site called “http://myserver and chosen the “Per-user identity” to connect with the credentials of the currently logged in user, rather than a shared user or something configured in the secure store (which is essentially another shared user scenario). Per-user identity makes a lot of sense most of the time for SharePoint lists, because SharePoint already controls the permissions to this data. If you have granted users access to the site in question already, the other options are overkill – especially in a single server scenario where Kerberos doesn’t have to be dealt with.

But when you try this for the first time, a common complaint is to see an error message like the one below.

image

This error is likely caused by the Claims to Windows Token Service (C2WTS) not being started. In case you are not aware of how things work behind the scenes, once a user is authenticated to SharePoint, claims authentication is used for inter/intra farm authentication between SharePoint services. Even when classic authentication is used when a user hits a SharePoint site, SharePoint converts it to a claims identity when it talks to service applications. This is all done for you behind the scenes and is all fine and dandy when SharePoint is talking to other SharePoint components, but if the service application needs to talk to an external data source that does not support claims authentication – say a SQL Database – then the Claims to Windows Token Service is used to convert the claims back to a standard Windows token.

Now based on what I just said, you might expect that PerformancePoint should use claims authentication when connecting to a SharePoint list as a data source – after all, we are not leaving the confines of SharePoint and I just told you that claims authentication is used for inter/infra farm authentication between SharePoint services, but this is not the case. When PerformancePoint connects to a SharePoint list (and that connection is set to Per-user identity as above), it converts the users claim back to a Windows Token first.

Now the error above is quite common and well documented. The giveaway is when you see an event log ID 1137 and in the details of the log, the message: Monitoring Service was unable to retrieve a Windows identity for “MyDomain\User”.  Verify that the web application authentication provider in SharePoint Central Administration is the default windows Negotiate or Kerberos provider. This is a pretty strong indicator that the C2WTS has not been configured or is not started.

A lesser known issue is one I came across the other day. In this case, Claims to Windows Token Service was provisioned and started, and a local administrator on the box. In dashboard designer, here was the message:

image

In this error, you will see an eventlog ID 1101 and in the details of the log, something like this:

PerformancePoint Services could not connect to the specified data source. Verify that either the current user or Unattended Service Account has read permissions to the data source, depending on your security configuration. Also verify that all required connection information is provided and correct.

System.Net.WebException: The remote name could not be resolved: ‘myserver’
at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
at System.Net.HttpWebRequest.GetRequestStream()
at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters)
at Microsoft.PerformancePoint.Scorecards.DataSourceProviders.ListService.GetListCollection()
at Microsoft.PerformancePoint.Scorecards.DataSourceProviders.SpListDataSourceProvider.GetCubeNameInfos()

PerformancePoint Services error code 201.

This issue has a misleading message. It suggests that the server “myserver” name could not be resolved, which might have you thinking this was a name resolution issue which is not the true root cause. This one is primarily due to a misconfiguration – most likely through Active Directory Group policy messing with server settings. A better hint to the true cause of this issue can be found in the security event log (assuming you have set the server audit policy to audit failures of “privilege use” which is not enabled by default).

image

The event ID to look for is 4673, and the Task Category is called “Sensitive Privilege Use”. A failure will be logged for the user account running the claims to windows service:

Level:         Information
Keywords:      Audit Failure
User:          N/A
Computer:      myserver.mydomain.com
Description:
A privileged service was called.

Subject:
Security ID:        mydomain\sp_c2wts
Account Name:        sp_c2wts
Account Domain:        mydomain

Service Request Information:
Privileges:        SeTcbPrivilege

The key bit to the log above is the “Service Request Information” section. “SeTcbPrivilege” means “To Act as Part of the Operating System”

Now the root cause of this issue is clear. The claims to windows token service is missing the “Act as part of the operating system” right which is one of its key requirements. You can find this setting by opening local security policy, choosing Local Policies->User Rights  Assignment. Add the claims to windows token service user account into this right. If you cannot do so, then it is governed by an Active Directory policy and you will have to go and talk to your Active Directory admin to get it done.

image

Hope this helps someone

Paul Culmsee

 

p.s Below this is the full eventlog details. I’ve only put it here for search engines so feel free to stop here Smile

 

Log Name:      Application
Source:        Microsoft-SharePoint Products-PerformancePoint Service
Date:          3/09/2013 10:37:12 PM
Event ID:      1101
Task Category: PerformancePoint Services
Level:         Error
Keywords:
User:          domain\sp_service
Computer:      SPAPP.mydomain.com
Description:
PerformancePoint Services could not connect to the specified data source. Verify that either the current user or Unattended Service Account has read permissions to the data source, depending on your security configuration. Also verify that all required connection information is provided and correct.

System.Net.WebException: The remote name could not be resolved: ‘sharepointaite’
at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
at System.Net.HttpWebRequest.GetRequestStream()
at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters)
at Microsoft.PerformancePoint.Scorecards.DataSourceProviders.ListService.GetListCollection()
at Microsoft.PerformancePoint.Scorecards.DataSourceProviders.SpListDataSourceProvider.GetCubeNameInfos()

PerformancePoint Services error code 201.

 

 

 

 

Log Name:      Application
Source:        Microsoft-SharePoint Products-PerformancePoint Service
Date:          3/09/2013 10:18:39 PM
Event ID:      1137
Task Category: PerformancePoint Services
Level:         Error
Keywords:
User:          domain\sp_service
Computer:      SPAPP.mydomain.com
Description:
The following data source cannot be used because PerformancePoint Services is not configured correctly.

Data source location: http://sharepoijntsite/Shared Documents/8_.000
Data source name: New Data Source

Monitoring Service was unable to retrieve a Windows identity for “domain\user”.  Verify that the web application authentication provider in SharePoint Central Administration is the default windows Negotiate or Kerberos provider.  If the user does not have a valid active directory account the data source will need to be configured to use the unattended service account for the user to access this data.

Exception details:
System.InvalidOperationException: Could not retrieve a valid Windows identity. —> System.ServiceModel.EndpointNotFoundException: The message could not be dispatched because the service at the endpoint address ‘net.pipe://localhost/s4u/022694f3-9fbd-422b-b4b2-312e25dae2a2′ is unavailable for the protocol of the address.

Server stack trace:
at System.ServiceModel.Channels.ConnectionUpgradeHelper.DecodeFramingFault(ClientFramingDecoder decoder, IConnection connection, Uri via, String contentType, TimeoutHelper& timeoutHelper)
at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.SendPreamble(IConnection connection, ArraySegment`1 preamble, TimeoutHelper& timeoutHelper)
at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.DuplexConnectionPoolHelper.AcceptPooledConnection(IConnection connection, TimeoutHelper& timeoutHelper)
at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnection(TimeSpan timeout)
at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.OnOpen(TimeSpan timeout)
at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannel.OnOpen(TimeSpan timeout)
at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannel.CallOnceManager.CallOnce(TimeSpan timeout, CallOnceManager cascade)
at System.ServiceModel.Channels.ServiceChannel.EnsureOpened(TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

Exception rethrown at [0]:
at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
at Microsoft.IdentityModel.WindowsTokenService.S4UClient.IS4UService_dup.UpnLogon(String upn, Int32 pid)
at Microsoft.IdentityModel.WindowsTokenService.S4UClient.CallService(Func`2 contractOperation)
at Microsoft.SharePoint.SPSecurityContext.GetWindowsIdentity()
— End of inner exception stack trace —
at Microsoft.SharePoint.SPSecurityContext.GetWindowsIdentity()
at Microsoft.PerformancePoint.Scorecards.ServerCommon.ConnectionContextHelper.SetContext(ConnectionContext connectionContext, ICredentialProvider credentials)

Log Name:      Security
Source:        Microsoft-Windows-Security-Auditing
Date:          4/09/2013 8:53:04 AM
Event ID:      4673
Task Category: Sensitive Privilege Use
Level:         Information
Keywords:      Audit Failure
User:          N/A
Computer:      SPAPP.mydomain.com
Description:
A privileged service was called.

Subject:
Security ID:        domain\sp_c2wts
Account Name:        sp_c2wts
Account Domain:        DOMAIN
Logon ID:        0xFCE1

Service:
Server:    NT Local Security Authority / Authentication Service
Service Name:    LsaRegisterLogonProcess()

Process:
Process ID:    0x224
Process Name:    C:\Windows\System32\lsass.exe

Service Request Information:
Privileges:        SeTcbPrivilege

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jan 03 2012

The cloud is not the problem–Part 3: When silos strike back…

What can Ikea fails tell us about cloud computing?

My next door neighbour is a builder. When he moved next door, the house was an old piece of crap. Within 6 months, he completely renovated it himself, adding in two bedrooms, an underground garage and all sorts of cool stuff. On the other hand, I bought my house because it was a good location, someone had already renovated it and all we had to do was move in. The reason for this was simple: I had a new baby and more importantly, me and power tools do not mix. I just don’t have the skills, nor the time to do what my neighbour did.

You can probably imagine what would happen if I tried to renovate my house the way my neighbour did. It would turn out like the Ikea fails in the video. Similarly, many SharePoint installs tend to look similar to the video too. Moral of the story? Sometimes it is better to get something pre-packaged than to do it yourself.

In the last post, we examined the “Software as a Service” (SaaS) model of cloud computing in the form of Office 365. Other popular SaaS providers include SlideShare, Salesforce, Basecamp and Tom’s Planner to name a few. Most SaaS applications are browser based and not as feature rich or complex as their on-premise competition. Therefore the SaaS model is that its a bit like buying a kit home. In SaaS, no user of these services ever touches the underlying cloud infrastructure used to provide the solution, nor do they have a full mandate to tweak and customise to their hearts content. SaaS is basically predicated on the notion that someone else will do a better set-up job than you and the old 80/20 rule about what features for an application are actually used.

Some people may regard the restrictions of SaaS as a good thing – particularly if they have dealt with the consequences of one too many unproductive customization efforts previously. As many SharePointer’s know, the more you customise SharePoint, the less resilient it gets. Thus restricting what sort of customisations can be done in many circumstances might be a wise thing to do.

Nevertheless, this actually goes against the genetic traits of pretty much every Australian male walking the planet. The reason is simple: no matter how much our skills are lacking or however inappropriate tools or training, we nevertheless always want to do it ourselves. This brings me onto our next cloud provider: Amazon, and their Infrastructure as a Service (IaaS) model of cloud based services. This is the ultimate DIY solution for those of us that find SaaS to cramping our style. Let’s take a closer look shall we?

Amazon in a nutshell

Okay, I have to admit that as an infrastructure guy, I am genetically predisposed to liking Amazon’s cloud offerings. Why? well as an infrastructure guy, I am like my neighbour who renovated his own house. I’d rather do it all myself because I have acquired the skills to do so. So for any server-hugging infrastructure people out there who are wondering what they have been missing out on? Read on… you might like what you see.

Now first up, its easy for new players to get a bit intimidated by Amazon’s bewildering array of offerings with brand names that make no sense to anybody but Amazon… EC2, VPC, S3, ECU, EBS, RDS, AMI’s, Availability Zones – sheesh! So I am going to ignore all of their confusing brand names and just hope that you have heard of virtual machines and will assume that you or your tech geeks know all about VMware. The simplest way to describe Amazon is VMWare on steroids. Amazon’s service essentially allows you to create Virtual Machines within Amazon’s “cloud” of large data centres around the world. As I stated earlier, the official cloud terminology that Amazon is traditionally associated is called Infrastructure as a Service (IaaS). This is where, instead of providing ready-made applications like SaaS, a cloud vendor provides lower level IT infrastructure for rent. This consists of stuff like virtualised servers, storage and networking.

Put simply, utilising Amazon, one can deploy virtual servers with my choice of operating system, applications, memory, CPU and disk configuration. Like any good “all you can eat” buffet, one is spoilt for choice. One simply chooses an Amazon Machine Image (AMI) to use as a base for creating a virtual server. You can choose one of Amazon’s pre-built AMI’s (Base installs of Windows Server or Linux) or you can choose an image from the community contributed list of over 8000 base images. Pretty much any vendor out there who sells a turn-key solution (such as those all-in-one virus scanning/security solutions) has likely created an AMI. Microsoft have also gotten in on the Amazon act and created AMI’s for you, optimised by their product teams. Want SQL 2008 the way Microsoft would install it? Choose the Microsoft Optimized Base SQL Server 2008R2 AMI which “contains scripts to install and optimize SQL Server 2008R2 and accompanying services including SQL Server Analysis services, SQL Server Reporting services, and SQL Server Integration services to your environment based on Microsoft best practices.”

The series of screen shots below shows the basic idea. After signing up, use the “Request instance wizard” to create a new virtual server by choosing an AMI first. In the example below, I have shown the default Amazon AMI’s under “Quick start” as well as the community AMI’s.

image61_thumb1
Amazons default AMI’s
image3_thumb
Community contributed AMI’s

From the list above, I have chosen Microsoft’s “Optimized SQL Server 2008 R2 SP1” from the community AMI’s and clicked “Select”. Now I can choose the CPU and memory configurations. Hmm how does a 16 core server sound with 60 gig of RAM? That ought to do it… :-)

image13_thumb

Now I won’t go through the full description of commissioning virtual servers, but suffice to say that you can choose which geographic location this server will reside within Amazon’s cloud and after 15 minutes or so, your virtual server will be ready to use. It can be assigned a public IP address, firewall restricted and then remotely managed as per any other server. This can all be done programmatically too. You can talk to Amazon via web services start, monitor, terminate, etc. as many virtual machines as you want, which allows you to scale your infrastructure on the fly and very quickly. There are no long procurement times and you then only pay for what servers are currently running. If you shut them down, you stop paying.

But what makes it cool…

Now I am sure that some of you might be thinking “big deal…any virtual machine hoster can do that.” I agree – and when I first saw this capability I just saw it as a larger scale VMWare/Xen type deployment. But when really made me sit up and take notice was Amazon’s Virtual Private Cloud (VPC) functionality. The super-duper short version of VPC is that it allows you extend your corporate network into the Amazon cloud. It does this by allowing you to define your own private network and connecting to it via site-to-site VPN technology. To describe how it works, diagrammatically check out the image below.

image_thumb13

Let’s use an example to understand the basic idea. Let’s say your internal IP address range at your office is 10.10.10.0 to 10.10.10.255 (a /24 for the geeks). With VPC you tell Amazon “I’d like a new IP address range of 10.10.11.0 to 10.10.11.255” . You are then prompted to tell Amazon the public IP address of your internet router. The screenshots below shows what happens next:

image_thumb14

image6_thumb

The first screenshot asks you to choose what type of router is at your end. Available choices are Cisco, Juniper, Yamaha, Astaro and generic. The second screenshot shows you a sample configuration that is downloaded. Now any Cisco trained person reading this will recognise what is going on here. This is the automatically generated configuration to be added to an organisations edge router to create an IPSEC tunnel. In other words, we have extended our corporate network itself into the cloud. Any service can be run on such a network – not just SharePoint. For smaller organisations wanting the benefits of off-site redundancy without the costs of a separate datacenter, this is a very cost effective option indeed.

For the Cisco geeks, the actual configuration is two GRE tunnels that are IPSEC encrypted. BGP is used for route table exchange, so Amazon can learn what routes to tunnel back to your on-premise network. Furthermore Amazon allows you to manage firewall settings at the Amazon end too, so you have an additional layer of defence past your IPSEC router.

This is called Virtual Private Cloud (VPC) and when configured properly is very powerful. Note the “P” is for private. No server deployed to this subnet is internet accessible unless you choose it to be. This allows you to extend your internal network into the cloud and gain all the provisioning, redundancy and scalability benefits without exposure to the internet directly. As an example, I did a hosted SharePoint extranet where we use SQL log shipping of the extranet content databases back to the a DMZ network for redundancy. Try doing that on Office365!

This sort of functionality shows that Amazon is a mature, highly scalable and flexible IaaS offering. They have been in the business for a long time and it shows because their full suite of offerings is much more expansive than what I can possibly cover here. Accordingly my Amazon experiences will be the subject of a more in-depth blog post or two in future. But for now I will force myself to stop so the non-technical readers don’t get too bored. :-)

So what went wrong?

So after telling you how impressive Amazon’s offering is, what could possibly go wrong? Like the Office365 issue covered in part 2, absolutely nothing with the technology. To understand why, I need to explain Amazon’s pricing model.

Amazon offer a couple of ways to pay for servers (called instances in Amazon speak). An on-demand instance is calculated based on a per-hour price while the server is running. The more powerful the server is in terms of CPU, memory and disk, the more you pay. To give you an idea, Amazon’s pricing for a Windows box with 8CPU’s and 16GB of RAM, running in Amazon’s “US east” region will set you back $0.96 per hour (as of 27/12/11). If you do the basic math for that, it equates to around $8409 per year, or $25228 over three years. (Yeah I agree that’s high – even when you consider that you get all the trappings of a highly scalable and fault tolerant datacentre).

On the other hand, a reserved instance involves making a one-time payment and in turn, receive a significant discount on the hourly charge for that instance. Essentially if you are going to run an Amazon server on a 24*7 basis for more than 18 months or so, a reserved instance makes sense as it reduces considerable cost over the long term. The same server would only cost you $0.40 per hour if you pay an up-front $2800 for a 3 year term. Total cost: $13312 over three years – much better.

So with that scene set, consider this scenario: Back at the start of 2011, a client of mine consolidated all of their SharePoint cloud services to Amazon from a variety of other another hosting providers. They did this for a number of reasons, but it basically boiled down to the fact they had 1) outgrown the SaaS model and 2) had a growing number of clients. As a result, requirements from clients were getting more complicated and beyond that which most of the hosting providers could cater for. They also received irregular and inconsistent support from their existing providers, as well as some unexpected downtime that reduced confidence. In short, they needed to consolidate their cloud offering and manage their own servers. They were developing custom SharePoint solutions, needed to support federated claims authentication and required disaster recovery assurance to mitigate the risk of going 100% cloud. Amazon’s VPC offering in particular seemed ideal, because it allowed full control of the servers in a secure way.

Now making this change was not something we undertook lightly. We spent considerable time researching Amazon’s offerings, trying to understand all the acronyms as well as their fine print. (For what its worth I used IBIS as the basis to develop an assessment and the map of my notes can be found here). As you are about to see though, we did not check well enough.

Back when we initially evaluated the VPC offering, it was only available in very few Amazon sites (two locations in the USA only) and the service was still in beta. This caused us a bit of a dilemma at the time because of the risk of relying on a beta service. But we were assured when Amazon confirmed that VPC would eventually be available in all of of their datacentres. We also stress tested the service for a few weeks, it remained stable and we developed and tested a disaster recovery strategy involving SQL log shipping and a standby farm. We also purchased reserved instances from Amazon since these servers were going to be there for the long haul, so we pre-paid to reduce the hourly rates. Quite a complex configuration was provisioned in only two days and we were amazed by how easy it all was.

Things hummed along for 9 months in this fashion and the world was a happy place. We were delighted when Amazon notified us that VPC had come out of beta and was now available in any of Amazon’s datacentres around the world. We only used the US datacentre because it was the only location available at the time. Now we wanted to transfer the services to Singapore. My client contacted Amazon about some finer points on such a move and was informed that they would have to pay for their reserved instances all over again!

What the?

It turns out, reserved instances are not transferrable! Essentially, Amazon were telling us that although we paid for a three year reserved instance, and only used it for 9 months, to move the servers to a new region would mean we have to pay all over again for another 3 year reserve. According to Amazon’s documentation, each reserved instance is associated with a specific region, which is fixed for the lifetime of the reserved instance and cannot be changed.

“Okay,” we answer, “we can understand that in circumstances where people move to another cloud provider. But in our case we were not.” We had used around 1/3rd of the reserved instance. So surely Amazon should pro-rata the unused amount, and offer that as a credit when we re-purchase reserved instances in Singapore? I mean, we will still be hosting with Amazon, so overall, they will not be losing any revenue al all. On the contrary, we will be paying them more, because we will have to sign up for an additional 3 years of reserve when we move the services.

So we ask Amazon whether that can be done. “Nope,” comes back the answer from amazons not so friendly billing team with one of those trite and grossly insulting “Sorry for any inconvenience this causes” ending sentences. After more discussions, it seems that internally within Amazon, each region or datacentre within each region is its own profit centre. Therefore in typical silo fashion, the US datacentre does not want to pay money to the Singapore operation as that would mean the revenue we paid would no longer recognised against them.

Result? Customer is screwed all because the Amazon fiefdoms don’t like sharing the contents of the till. But hey – the regional managers get their bonuses right? Sad smile

Conclusion

Like part 2 of this cloud computing series, this is not a technical issue. Amazon’s cloud service in our experience has been reliable and performed well. In this case, we are turned off by the fact that their internal accounting procedures create a situation that is not great for customers who wish to remain loyal to them. In a post about the danger of short termism and ignoring legacy, I gave the example of how dumb it is for organisations to think they are measuring success based on how long it takes to close a helpdesk call. When such a KPI is used, those in support roles have little choice but to try and artificially close calls when users problems have not been solved because that’s how they are deemed to be performing well. The reality though is rather than measure happy customers, this KPI simply rewards which helpdesk operators have managed to game the system by getting callers off the phone as soon as they can.

I feel that Amazon are treating this is an internal accounting issue, irrespective of client outcomes. Amazon will lose the business of my client because of this since they have enough servers hosted where the financial impost of paying all over again is much more than transferring to a different cloud provider. While VPC and automated provisioning of virtual servers is cool and all, at the end of the day many hosting providers can offer this if you ask them. Although it might not be as slick with fancy as Amazon’s automated configuration, it nonetheless is very doable and the other providers are playing catch-up. Like Apple, Amazon are enjoying the benefits of being first to market with their service, but as competition heats up, others will rapidly bridge the gap.

Thanks for reading

Paul Culmsee

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Dec 19 2011

The cloud isn’t the problem–Part 2: When complex technology meets process…

Hi all

Welcome to my second post that delves into the irrational world of cloud computing. In the first post, I described my first foray into the world of web hosting, which started way back in 2000. Back then I was more naive than I am now (although when it comes to predicting the future I am as naive as anybody else.) I concluded Part 1 by asserting that cloud computing is an adaptive change. We are going to explore the effects of this and the challenges it poses in the next few posts.

Adaptive change occurs in a number of areas, including the companies providing a cloud application – especially if on-premise has been the basis of their existence previously. To that end, I’d like to tell you an Office 365 fail story and then see what lessons we can draw from it.

Office 365 and Software as a Service…

For those who have ignored the hype, Office 365 known in cloud speak as “Software as a Service” (SaaS). Basically one gets SharePoint, Exchange mail, web versions of Office Applications and Lync all bundled up together. In Office 365, SharePoint is not run on-premise at all, and instead it is all run from Microsoft servers in a subscription arrangement. Once a month you pay Microsoft for the number of users using the service and the world is a happy place.

Office 365, like many SaaS models, keeps much of the complexity of managing SharePoint in the hands of Microsoft. A few years back, Office 365 would have been described in hosting terms as a managed service. Like all managed services, one sacrifices a certain level of control by outsourcing the accompanying complexity. You do not manage or control the underlying cloud infrastructure including network, servers, operating systems, SharePoint farm settings or storage. Furthermore, limited custom code will run on Office 365, because developers do not have back-end access. Only sandbox solutions are available, and even then, there are some additional limitations when compared to on-premise sandbox solutions. You have limited control of SharePoint service applications too, so the best way to think about Office 365 is that your administrative control extends to the site collection level (this is not actually true but suffices for this series.)

One key reason why its hard to get feature parity between on-premise and SaaS equivalents is because many SaaS architectures are based around the concept of multitenancy. If you have heard this word bandied about in SharePoint land is because it is something that is supported in SharePoint 2010. But the concept extends to the majority of SaaS providers. To understand it, imagine an swanky office building in the up-market part of town. It has a bunch of companies that rent out office space and are therefore tenants. No tenant can afford an entire building, so they all lease office space from the building and enjoy certain economies of scale like a great location, good parking, security and so on. This does have a trade-off though. The tenants have to abide by certain restrictions. An individual tenant can’t just go and paint the building green because it matches their branding. Since the building is a shared resource, it is unlikely the other tenants would approve.

Multi tenancy allows the SaaS vendor to support multiple customers with a single platform. The advantages of this model is economies of scale, but the trade-off is the aforementioned customisation flexibility. SaaS vendors will talk up this by telling you that SaaS applications can be updated more frequently than on-premise software, since there is less customisation complexity from each individual customer. While that’s true, it nevertheless means a loss of control or choice in areas like data security, integration with on-premise systems, latency and flexibility of the application to accommodate change as an organisation grows.

A small example of the restricting effect of multi-tenancy is when you upload a PDF into a SharePoint document library in Office 365. You cannot open the PDF in the browser and instead you are prompted to save it locally. This is because of a well-known issue with a security feature that was added to IE8. In the on-premise SharePoint world, you can modify the behaviour by changing the “Browser File Handling” option in the settings of the affected Web Application. But with Office 365, you have to live with it (or use a less than elegant workaround) because you do not have any access at a web application level to change the behaviour. Changing it will affect any tenants serviced by that web application.

Minor annoyances aside, if you are a small organisation or you need to mobilise quickly on a project with a geographically dispersed team, Office 365 is a very sweet offering. It is powerful and integrated, and while not fully featured compared to on-premise SharePoint, it is nonetheless impressive. One can move very quickly and be ready to go with within one or two business days – that is, if you don’t make a typo…

How a typo caused the world to cave in…

A while back, I was part of a geographically dispersed, multi-organisation team that needed a collaborative portal for around a year. Given the project team was distributed across varying organisations, various parts of Australia, the fact that one of the key stakeholders had suggested SharePoint, and the fact that Office 365 behaved much better than Google apps behind overly paranoid proxy servers of participating organisations, Office 365 seemed ideal and we resolved to use it. I signed up to a Microsoft Office 365 E3 service.

Now when I say sign up, Microsoft uses Telstra in Australia as their Office 365 partner, so I was directed to Telstra’s sign up site. My first hint of trouble to come was when I was asked to re-enter my email address in the signup field. Through some JavaScript wizard no doubt, I was unable to copy/paste my email address into the confirmation field. They actually made me re-type it. “Hmm” I thought, “they must really be interested in data validation. At least it reduces the chance that people do not copy/paste the wrong information into a critical field.” I also noted that there was also some nice JavaScript that suggested the strength of the password chosen as it was typed.

But that’s where the fun ended. Soon after entering the necessary detail, and obligatory payment details, I am asked to enter a mysterious thing listed only as an Organization Level Attribute and more specifically, “Microsoft Online Services Company Identifier.” Checking the question mark icon tells me that it is “used to create your Microsoft Online Services account identity.”

image

I wondered if this was the domain name for the site, as there was no descriptive indicator as to the significance of this code. For all I knew, it could be a Microsoft admin code or accounting code. Nevertheless I assumed it was the domain name because I just had a feeling it was. So I entered my online identity and away I went. I got a friendly email message to say things were in motion and I waited my obligatory hour or so for things to provision.

The inbox sound chimes and I received two emails. One told me I now have a “Telstra t-Suite account” and the other is entitled “Registration confirmation from Microsoft online.” I was thanked for purchasing and the email stated that “the services are managed via Microsoft Online Portal (MOP), a separate portal to the Telstra T-Suite Management Console.” I had no idea what the Telstra T-Suite Management Console was at this point, but I was invited to log into the Microsoft Online Portal with a supplied username and password.

At this point I swore…I could see by my username, that I made a typo in the Microsoft Online Services Company Identifier. Username: [email protected] – which means I typed in “SampleProjject” instead of “SampleProject” (Aargh!)

The saga begins…

Swearing at my dyslexic typing, I logged a support call to Telstra in the faint hope that I can change this before it’s too late… Below is the anonymised mail I sent:

“Hiya

In relation to the order below I accidentally set SampleProjject as the identifier when it should be SampleProject. Can this be rectified before things are commissioned?

Thanks

Paul”

Another hour passed by and my inbox chimed again with a completely unsurprising reply to my query.

“Hi Paul , sorry but company identifier can not be changed because it is used to identify the account in Office 365 database.”

Cursing once again at my own lack of checking, I cannot help but shake my head in that while I was forced to type in my email address twice (and with cut and paste disabled) when I signed up to Office 365, I was given no opportunity to verify the Microsoft Online Services Company Identifier (henceforth known as MOSI) before giving the final go-ahead. Surely this identifier is just as important as the email address? Therefore, why not ask for it to be entered twice or visually make it clear what the purpose of this identifier is? Then dumb users like me would get a second chance before opening the hellgate, unleashing forces that can never be contained.

At the end of the day though, the fault was mine so while I think Telstra could do better with their validation and conveying the significance of the MOSI, I caused the issue.

Forces are unleashed…

So I log into Telstra’s t-suite system and try and locate my helpdesk call entry. The t-suite site, although not SharePoint, has a bit of a web part feel about it – only like when you have fixed the height of a web part far too small. It turned out that their site doesn’t handle IE9 well. If you look closely the “my helpdesk cases” and “my service access” are collapsed to the point that I can’t actually see anything. So I tried Chrome and was able to operate the portal like a normal person would. My teeth gnashed once more…

image  image

Finally, being able to take an action, I open my support request and ask the following:

*** NOTES created by Paul Culmsee
Can I cancel this account and re provision? A typo was made when the MOSI was entered.  The domain name is incorrect for the site.

A few emails went back and forth and I received a confirmation that the account is cancelled. I then return to the Office 365 site and re-apply for an E3 service. This time I triple checked my spelling of the MOSI and clicked “proceed.” I received an email that thanked me for my application and that I should receive a provisioning notification within an hour or so.

So I wait…

and wait…

and wait…

and wait…

24 hours went by and I received no notification of the E3 service being provisioned. I log into Telstra’s t-suite and log a new call, asking when things will be provisioned. Here is what I asked…

Hi there, I have had no notification of this being provisioned from Microsoft. Surely this should be done by now?

In typical level 1 helpdesk fashion, the guy on the other end did not actually read what I wrote. He clearly missed the word “no”

Hi Paul,

that’s affirmative. Your T-Suite order has been provisioned. As per the instructions in the welcome email you can follow the links to log in to portal.

Contact me on 1800TSUITE Option 2.3 to discuss it further. I’ll keep this case open for a week.

*sigh* – this sort of bad level 1 email support actually does a lot of damage to the reputation of the organisation so I mail back…

But I received no welcome email from Microsoft with the online password details… I have no means to log into the portal

This inane exchange costs me half a day, so I took Telstra’s friendly advice and contacted them “on 1800TSUITE Option 2.3 to discuss it further.” I got a pretty good tech who realised there indeed was a problem. He told me he would look into it and I thanked him for his time. Sometime later he called back and advised me that something was messed up in the provisioning process and that the easiest thing to do, was for him to delete my most recent E3 application, and for me to sign up from scratch using a totally different email address and a totally new MOSI. Somehow, either Telstra’s or Microsoft’s systems had associated my email address and MOSI with the original, failed attempt to sign up (the one with the typo), and it was causing the provisioning process to have an exception somewhere along the line.

In hearing this, I can imagine some giant PowerShell provisioning script with dodgy exception handling getting halfway through and then dying on them. So I was happy to follow the tech’s advice went through the entire Office 365 sign up process from the very beginning again (this is the third time). This time I used a fresh email address and quadruple checked all of the fields before I provisioned. Eureka! This time things worked as planned. I received all the right confirmation emails and I was able to sign into the Microsoft online portal. From there I created user accounts, provisioned a SharePoint site collection and we were ready to rock and roll. Although the entire saga ended up taking 5 business days from start to finish, I have my portal and the project team got down to business.

Now for what it’s worth, it should be noted that if you are an integrator or are in the business of managing multiple Office 365 services, Telstra requires a different email address to be used for each Office 365 service you purchase. One cannot have an alias like provision@myoffice365supportprovider as the general account used to provision multiple E1-E4 services. Each needs its own t-suite account with a different email address.

Plunged into darkness…

Things hummed along for a couple of months with no hiccups. We received an invoice for the service by email, and then a couple of days later, received a mail to confirm that in fact the invoice has been automatically paid via credit card. For our purposes, Office 365 was a really terrific solution and the project team really liked it and were getting a lot of value out off it.

I then had to travel overseas and while I was gone, suddenly the project team were unable to login to the portal. They would receive a “subscription expired” message when attempting to login. Now this was pretty serious as a project team was coming to an important deadline and now no-one could log in.  We checked the VISA records and it seemed that the latest invoice had not been deducted from the account as there was still a balance owing. Since I was in overseas, one of my colleagues immediately called up Telstra support (it was now after hours in Perth) and was stuck in a queue for an hour and then ended up speaking to two support people. After all of the fuss with the provisioning issues around the MOSI and my typo, it seemed that Telstra support didn’t actually know what a MOSI was in any event. This is what my colleague said:

I was asked for an account number straight away both times, and I explained that I didn’t have one, but I did have the invoice number in question, and that this was a Microsoft Office 365 subscription. They were still unable to locate the account or invoice. I then gave them the MOSI, thinking this would help. Unfortunately, they both had no idea what I was talking about! I explained that users were unable to login to the site with a ‘subscription expired’ error message. I also explained the fact that the VISA had not been processed for this period (although it was fine in the last period).

Both support staff could not access the Office 365 subscription information (even after I gave them our company name). Because I called after hours, t-suite department was not available. The two staff I talked to could not access the account, so could not pull up any of the relevant details. It turns out that after business hours, Telstra redirect t-suite support to the mobile and phones department. The first support person passed me onto technical but the transfer was rerouted to the original call menu – so I went through the whole thing again, press x for this, press x for that, etc. The second time round, I explained it all over again. The tech assured me that it couldn’t be a billing issue and that Telstra generally would not suspend an account because of a few days late payment. If that was the case, prior to suspension, Telstra would send out an email to notify customers of overdue payment. I told him that no such email had been sent. He then said that it would most likely be a technical problem and would have to be dealt with the next day as the T-suite department would not be available til next morning between office hours 9-5pm EST.

I hung up frustrated, no closer to solving the problem after two hours on the phone.

My colleague then got up early and called Telstra at 6am the next day (9am EST is 3 hours ahead of Perth time). She explained the situation again to Telstra t-suite support person all over again. Here again is the words of my colleague:

The first person who took my call (who I will call “girl one”) couldn’t give me an answer and said she’d get someone to call back, and in the meantime she’d check with another department for me. She put me on hold and during this time the call was re-routed back to the original menu when you first call. I thought that instead of waiting for a call that I may not receive soon as this was an emergency, I went through the menu again. This time I got “girl two” and explained the whole thing *again*. I got her to double check that the E3 subscription was set to automatically deduct from the VISA supplied – yes, it was. She noticed that it said 0 licenses available. She told me that she was not sure what that was all about, so would log a call with Microsoft. Girl two advised me that it could take any time between an hour to a few days for a response from Microsoft.

I then got a call from Telstra (girl three) on the cell-phone just after I finished with girl two. This was the person who girl one promised would call back. I told her what I’d gone through with all support staff so far, and that “girl two” was going to log a call with Microsoft. Girl three, like girl two, noticed the 0 licenses available. She wasn’t sure that it was because there were none to begin with or that there were no more available. I stated that the site had been working fine till yesterday. I explained that no one could access the site and that they all got the same message. Same as girl two, girl three advised that she would also log a call with Microsoft. Again, I was told that it could take up to several days before I could get a reply.

Half an hour later, we received an email from Telstra t-suite support. It stated the following:

Case Number: xxxxxxxx-xxxxxxx

Case Subject: subscription has expired for all users

I checked your account info and invoices. The invoice xxxxx paid for 01 Oct to 01 Nov was for company ID SampleProjject not SampleProject. Please call billing department to change it for you.

With this email, we now knew that the core problem here was related to billing in some way. As far as we had been told, Telstra had deleted the original two failed Office 365 subscriptions, but apparently not from their billing systems. The bill was paid against a phantom E3 service – the deleted one called “SampleProjject”. Accordingly the live service had expired and users were locked out of the system.

As instructed in the above email, my colleague called up t-suite billing (there was a phone number on the invoice). In her words:

Once again, the support person asked for the account number to which I said I didn’t have one. I offered him the invoice number and the MOSI, thinking someone’s got to know what it was since it was ‘used to identify the account in Office 365 database.’ He stated he could not ‘pull up an account with the MOSI’ and said something to the effect that he didn’t know what the MOSI ‘was all about’. He asked what company registered the service and I gave him our details. He immediately saw several ‘accounts’ in the billing system related to our company. He noted that the production E3 was a trial subscription and the trial had now expired and he surmised that the problem was most likely due to that fact. I queried why this was the case when the payment subscription was set to automatically deduct from the supplied visa account. He told me that as going from trial to production was a sales thing, I would have to speak to t-suite sales department. He also added that we were lucky because there was a risk that the mistakenly expired E3 service could have been deleted from Office 365.

I called up sales and finally, they were able to correct the problem.

So after a long, stressful and chaotic evening and morning, Armageddon was averted and the portal users were able to log in again.

Reflections…

This whole story started from something seemingly innocuous – a typo that I made on a poorly described text box (MOSI). From it, came a chain of events that could have resulted in a production E3 service being mistakenly deleted. There were multiple failures at various levels (including my bad typing that set this whole thing off). Nevertheless, first thing that becomes obvious is that this was a high risk issue that had utterly nothing to do with the Office 365 service itself. As I said, the feedback from the project team has been overwhelmingly positive for Office 365. There was no bug or no extended outage because of any technical factors. Instead, it was the lack of resilience in the systems and processes that surround the Office 365 service. At the end of the day, we got almost nailed because of a billing screw-up. It was exacerbated by some poor technical support outcomes. Witness the number of people and departments my colleague had to go through to get a straight answer, as well as the two times she was redirected back to the main phone menuing system when she was supposed to be transferred.

Now I don’t blame any of the tech support staff (okay, except the first guy who did not read my initial query). I think that the tech support themselves were equally hamstrung by immature process and poor integration of systems. What was truly scary about this issue was that it snuck up upon us from left field. We thought the issue was resolved once the service was finally provisioned (third time lucky), and had email receipts of paid invoices. Yet this near fatal flaw was there all along, only manifesting some three months later when the evaluation period expired.

I think there are a number of specific aspects to this story that Microsoft needs to reflect on. I have summarised these below:

  • Why is the registration process to sign up to Office 365 via Telstra such a complete fail of the “Don’t Make Me Think” test.
  • Why is the significance of the MOSI not made more clear when you first enter it (given you have to enter your email address twice)?
  • Why did no-one at all in Telstra support have the faintest idea what a MOSI is?
  • When you entrust your data and service to a cloud provider how confident do you feel when tech support completely misinterprets your query and answers a completely opposite question?
  • How do you think customers with a critical issue feel when the company that sits between you and Microsoft tells you that it will take “between an hour to a few days for a response from Microsoft”. Vote of confidence?
  • How do you think customers with a critical issue feel when the company that sits between you and Microsoft redirects tech support to their cell phone division after hours?
  • How do you think customers with a critical issue feel when the company that sits between you and Microsoft has to pass you around from department to department to solve an issue, and along the way, re-route you back to the main support line?
  • We were advised to delete our E3 accounts and start all over again. Why did Telstra’s systems not delete the service out of their billing systems? Presumably they are not integrated, given that from a billing perspective, the old E3 service was still there?

Now I hope that I don’t sound bitter and twisted from this experience. In fact, the experience reinforced what most in IT strategy already know. It’s not about the technology. I still like what Office 365 offers and I will continue to use and recommend it under the right circumstances. This experience was simply a sobering reality check though that all of the cool features amounts to naught when it can be undone by dodgy underlying supporting structures. I hope that Microsoft and Telstra read this and learn from it too. From a customer perspective, having to work through Telstra as a proxy for Microsoft feels like additional layers of defence on behalf of Microsoft. Is all of this duplication really necessary? Why can’t Australian customers work directly with Microsoft like the US can?

Moving on…

No cloud provider is immune to these sorts of stories – and for that matter no on-premise provider is immune either. So for Amazon fanboys out there who want to take this post as evidence to dump on Microsoft, I have some news for you too. In the next post in this series, I am going to tell you an Amazon EC2 story that, while not being an issue that resulted in an outage, nevertheless represents some very short sighted dumbass policies. The result of which, we are literally forced to hand our business to another cloud provider.

Until then, thanks for reading and happy clouding :-)

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jul 22 2011

Troubleshooting SharePoint (People) Search 101

I’ve been nerding it up lately SharePointwise, doing the geeky things that geeks like to do like ADFS and Claims Authentication. So in between trying to get my book fully edited ready for publishing, I might squeeze out the odd technical SharePoint post. Today I had to troubleshoot a broken SharePoint people search for the first time in a while. I thought it was worth explaining the crawl process a little and talking about the most likely ways in which is will break for you, in order of likelihood as I see it. There are articles out on this topic, but none that I found are particularly comprehensive.

Background stuff

If you consider yourself a legendary IT pro or SharePoint god, feel free to skip this bit. If you prefer a more gentle stroll through SharePoint search land, then read on…

When you provision a search service application as part of a SharePoint installation, you are asked for (among other things), a windows account to use for the search service. Below shows the point in the GUI based configuration step where this is done. First up we choose to create a search service application, and then we choose the account to use for the “Search Service Account”. By default this is the account that will do the crawling of content sources.

image    image

Now the search service account is described as so: “.. the Windows Service account for the SharePoint Server Search Service. This setting affects all Search Service Applications in the farm. You can change this account from the Service Accounts page under Security section in Central Administration.”

In reading this, suggests that the windows service (“SharePoint Server Search 14”) would run under this account. The reality is that the SharePoint Server Search 14 service account is the farm account. You can see the pre and post provisioning status below. First up, I show below where SharePoint has been installed and the SharePoint Server Search 14 service is disabled and with service credentials of “Local Service”.

image

The next set of pictures show the Search Service Application provisioned according to the following configuration:

  • Search service account: SEVENSIGMA\searchservice
  • Search admin web service account: SEVENSIGMA\searchadminws
  • Search query and site settings account: SEVENSIGMA\searchqueryss

You can see this in the screenshots below.

image

imageimage

Once the service has been successfully provisioned, we can clearly see the “Default content access account” is based on the “Search service account” as described in the configuration above (the first of the three accounts).

image

Finally, as you can see below, once provisioned, it is the SharePoint farm account that is running the search windows service.

image

Once you have provisioned the Search Service Application, the default content access (in my case SEVENSIGMA\searchservice), it is granted “Read” access to all web applications via Web Application User Policies as shown below. This way, no matter how draconian the permissions of site collections are, the crawler account will have the access it needs to crawl the content, as well as the permissions of that content. You can verify this by looking at any web application in Central Administration (except for central administration web application) and choosing “User Policy” from the ribbon. You will see in the policy screen that the “Search Crawler” account has “Full Read” access.

image

image

In case you are wondering why the search service needs to crawl the permissions of content, as well as the content itself, it is because it uses these permissions to trim search results for users who do not have access to content. After all, you don’t want to expose sensitive corporate data via search do you?

There is another more subtle configuration change performed by the Search Service. Once the evilness known as the User Profile Service has been provisioned, the Search service application will grant the Search Service Account specific permission to the User Profile Service. SharePoint is smart enough to do this whether or not the User Profile Service application is installed before or after the Search Service Application. In other words, if you install the Search Service Application first, and the User Profile Service Application afterwards, the permission will be granted regardless.

The specific permission by the way, is “Retrieve People Data for Search Crawlers” permission as shown below:

image    image

Getting back to the title of this post, this is a critical permission, because without it, the Search Server will not be able to talk to the User Profile Service to enumerate user profile information. The effect of this is empty "People Search results.

How people search works (a little more advanced)

Right! Now that the cool kids have joined us (who skipped the first section), lets take a closer look at SharePoint People Search in particular. This section delves a little deeper, but fear not I will try and keep things relatively easy to grasp.

Once the Search Service Application has been provisioned, a default content source, called – originally enough – “Local SharePoint Sites” is created. Any web applications that exist (and any that are created from here on in) will be listed here. An example of a freshly minted SharePoint server with a single web application, shows the following configuration in Search Service Application:

image

Now hopefully http://web makes sense. Clearly this is the URL of the web application on this server. But you might be wondering that sps3://web is? I will bet that you have never visited a site using sps3:// site using a browser either. For good reason too, as it wouldn’t work.

This is a SharePointy thing – or more specifically, a Search Server thing. That funny protocol part of what looks like a URL, refers to a connector. A connector allows Search Server to crawl other data sources that don’t necessarily use HTTP. Like some native, binary data source. People can develop their own connectors if they feel so inclined and a classic example is the Lotus Notes connector that Microsoft supply with SharePoint. If you configure SharePoint to use its Lotus Notes connector (and by the way – its really tricky to do), you would see a URL in the form of:

notes://mylotusnotesbox

Make sense? The protocol part of the URL allows the search server to figure out what connector to use to crawl the content. (For what its worth, there are many others out of the box. If you want to see all of the connectors then check the list here).

But the one we are interested in for this discussion is SPS3: which accesses SharePoint User profiles which supports people search functionality. The way this particular connector works is that when the crawler accesses this SPS3 connector, it in turns calls a special web service at the host specified. The web service is called spscrawl.asmx and in my example configuration above, it would be http://web/_vti_bin/spscrawl.asmx

The basic breakdown of what happens next is this:

  1. Information about the Web site that will be crawled is retrieved (the GetSite method is called passing in the site from the URL (i.e the “web” of sps3://web)
  2. Once the site details are validated the service enumerates all of the use profiles
  3. For each profile, the method GetItem is called that retrieves all of the user profile properties for a given user. This is added to the index and tagged as content class of “urn:content-class:SPSPeople” (I will get to this in a moment)

Now admittedly this is the simple version of events. If you really want to be scared (or get to sleep tonight) you can read the actual SP3 protocol specification PDF.

Right! Now lets finish this discussion by this notion of contentclass. The SharePoint search crawler tags all crawled content according to its class. The name of this “tag” – or in correct terminology “managed property” – is contentclass. By default SharePoint has a People Search scope. It is essentially a limits the search to only returning content tagged as “People” contentclass.

image

Now to make it easier for you, Dan Attis listed all of the content classes that he knew of back in SharePoint 2007 days. I’ll list a few here, but for the full list visit his site.

  • “STS_Web” – Site
  • “STS_List_850″ – Page Library
  • “STS_List_DocumentLibrary” – Document Library
  • “STS_ListItem_DocumentLibrary” – Document Library Items
  • “STS_ListItem_Tasks” – Tasks List Item
  • “STS_ListItem_Contacts” – Contacts List Item
  • “urn:content-class:SPSPeople” – People

(why some properties follow the universal resource name format I don’t know *sigh* – geeks huh?)

So that was easy Paul! What can go wrong?

So now we know that although the protocol handler is SPS3, it is still ultimately utilising HTTP as the underlying communication mechanism and calling a web service, we can start to think of all the ways that it can break on us. Let’s now take a look at common problem areas in order of commonality:

1. The Loopback issue.

This has been done to death elsewhere and most people know it. What people don’t know so well is that the loopback fix was to prevent an extremely nasty security vulnerability known as a replay attack that came out a few years ago. Essentially, if you make a HTTP connection to your server, from that server and using a name that does not match the name of the server, then the request will be blocked with a 401 error. In terms of SharePoint people search, the sps3:// handler is created when you create your first web application. If that web application happens to be a name that doesn’t match the server name, then the HTTP request to the spscrawl.asmx webservice will be blocked due to this issue.

As a result your search crawl will not work and you will see an error in the logs along the lines of:

  • Access is denied: Check that the Default Content Access Account has access to the content or add a crawl rule to crawl the content (0x80041205)
  • The server is unavailable and could not be accessed. The server is probably disconnected from the network.   (0x80040d32)
  • ***** Couldn’t retrieve server http://web.sevensigma.com policy, hr = 80041205 – File:d:\office\source\search\search\gather\protocols\sts3\sts3util.cxx Line:548

There are two ways to fix this. The quick way (DisableLoopbackCheck) and the right way (BackConnectionHostNames). Both involve a registry change and a reboot, but one of them leaves you much more open to exploitation. Spence Harbar wrote about the differences between the two some time ago and I recommend you follow his advice.

(As an slightly related side note, I hit an issue with the User Profile Service a while back where it gave an error: “Exception occurred while connecting to WCF endpoint: System.ServiceModel.Security.MessageSecurityException: The HTTP request was forbidden with client authentication scheme ‘Anonymous’. —> System.Net.WebException: The remote server returned an error: (403) Forbidden”. In this case I needed to disable the loopback check but I was using the server name with no alternative aliases or full qualified domain names. I asked Spence about this one and it seems that the DisableLoopBack registry key addresses more than the SMB replay vulnerability.)

2. SSL

If you add a certificate to your site and mark the site as HTTPS (by using SSL), things change. In the example below, I installed a certificate on the site http://web, removed the binding to http (or port 80) and then updated SharePoint’s alternate access mappings to make things a HTTPS world.

Note that the reference to SPS3://WEB is unchanged, and that there is also a reference still to HTTP://WEB, as well as an automatically added reference to HTTPS://WEB

image

So if we were to run a crawl now, what do you think will happen? Certainly we know that HTTP://WEB will fail, but what about SPS3://WEB? Lets run a full crawl and find out shall we?

Checking the logs, we have the unsurprising error “the item could not be crawled because the crawler could not contact the repository”. So clearly, SPS3 isn’t smart enough to work out that the web service call to spscrawl.asmx needs to be done over SSL.

image

Fortunately, the solution is fairly easy. There is another connector, identical in function to SPS3 except that it is designed to handle secure sites. It is “SPS3s”. We simple change the configuration to use this connector (and while we are there, remove the reference to HTTP://WEB)

image

Now we retry a full crawl and check for errors… Wohoo – all good!

image

It is also worth noting that there is another SSL related issue with search. The search crawler is a little fussy with certificates. Most people have visited secure web sites that warning about a problem with the certificate that looks like the image below:

image

Now when you think about it, a search crawler doesn’t have the luxury of asking a user if the certificate is okay. Instead it errs on the side of security and by default, will not crawl a site if the certificate is invalid in some way. The crawler also is more fussy than a regular browser. For example, it doesn’t overly like wildcard certificates, even if the certificate is trusted and valid (although all modern browsers do).

To alleviate this issue, you can make the following changes in the settings of the Search Service Application: Farm Search Administration->Ignore SSL warnings and tick “Ignore SSL certificate name warnings”.

image  image

image

The implication of this change is that the crawler will now accept any old certificate that encrypts website communications.

3. Permissions and Change Legacy

Lets assume that we made a configuration mistake when we provisioned the Search Service Application. The search service account (which is the default content access account) is incorrect and we need to change it to something else. Let’s see what happens.

In the search service application management screen, click on the default content access account to change credentials. In my example I have changed the account from SEVENSIGMA\searchservice to SEVENSIGMA\svcspsearch

image

Having made this change, lets review the effect in the Web Application User Policy and User Profile Service Application permissions. Note that the user policy for the old search crawl account remains, but the new account has had an entry automatically created. (Now you know why you end up with multiple accounts with the display name of “Search Crawling Account”)

image

Now lets check the User Profile Service Application. Now things are different! The search service account below refers to the *old* account SEVENSIGMA\searchservice. But the required permission of “Retrieve People Data for Search Crawlers” permission has not been granted!

image

 

image

If you traipsed through the ULS logs, you would see this:

Leaving Monitored Scope (Request (GET:https://web/_vti_bin/spscrawl.asmx)). Execution Time=7.2370958438429 c2a3d1fa-9efd-406a-8e44-6c9613231974
mssdmn.exe (0x23E4) 0x2B70 SharePoint Server Search FilterDaemon e4ye High FLTRDMN: Errorinfo is "HttpStatusCode Unauthorized The request failed with HTTP status 401: Unauthorized." [fltrsink.cxx:553] d:\office\source\search\native\mssdmn\fltrsink.cxx
mssearch.exe (0x02E8) 0x3B30 SharePoint Server Search Gatherer cd11 Warning The start address sps3s://web cannot be crawled. Context: Application ‘Search_Service_Application’, Catalog ‘Portal_Content’ Details: Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has "Full Read" permissions on the SharePoint Web Application being crawled. (0x80041205)

To correct this issue, manually grant the crawler account the “Retrieve People Data for Search Crawlers” permission in the User Profile Service. As a reminder, this is done via the Administrators icon in the “Manage Service Applications” ribbon.

image

Once this is done run a fill crawl and verify the result in the logs.4.

4. Missing root site collection

A more uncommon issue that I once encountered is when the web application being crawled is missing a default site collection. In other words, while there are site collections defined using a managed path, such as http://WEB/SITES/SITE, there is no site collection defined at HTTP://WEB.

The crawler does not like this at all, and you get two different errors depending on whether the SPS or HTTP connector used.

  • SPS:// – Error in PortalCrawl Web Service (0x80042617)
  • HTTP:// – The item could not be accessed on the remote server because its address has an invalid syntax (0x80041208)

image

The fix for this should be fairly obvious. Go and make a default site collection for the web application and re-run a crawl.

5. Alternative Access Mappings and Contextual Scopes

SharePoint guru (and my squash nemesis), Nick Hadlee posted recently about a problem where there are no search results on contextual search scopes. If you are wondering what they are Nick explains:

Contextual scopes are a really useful way of performing searches that are restricted to a specific site or list. The “This Site: [Site Name]”, “This List: [List Name]” are the dead giveaways for a contextual scope. What’s better is contextual scopes are auto-magically created and managed by SharePoint for you so you should pretty much just use them in my opinion.

The issue is that when the alternate access mapping (AAM) settings for the default zone on a web application do not match your search content source, the contextual scopes return no results.

I came across this problem a couple of times recently and the fix is really pretty simple – check your alternate access mapping (AAM) settings and make sure the host header that is specified in your default zone is the same url you have used in your search content source. Normally SharePoint kindly creates the entry in the content source whenever you create a web application but if you have changed around any AAM settings and these two things don’t match then your contextual results will be empty. Case Closed!

Thanks Nick

6. Active Directory Policies, Proxies and Stateful Inspection

A particularly insidious way to have problems with Search (and not just people search) is via Active Directory policies. For those of you who don’t know what AD policies are, they basically allow geeks to go on a power trip with users desktop settings. Consider the image below. Essentially an administrator can enforce a massive array of settings for all PC’s on the network. Such is the extent of what can be controlled, that I can’t fit it into a single screenshot. What is listed below is but a small portion of what an anal retentive Nazi administrator has at their disposal (mwahahaha!)

image

Common uses of policies include restricting certain desktop settings to maintain consistency, as well as enforce Internet explorer security settings, such as proxy server and security settings like maintaining the trusted sites list. One of the common issues encountered with a global policy defined proxy server in particular is that the search service account will have its profile modified to use the proxy server.

The result of this is that now the proxy sits between the search crawler and the content source to be crawled as shown below:

Crawler —–> Proxy Server —–> Content Source

Now even though the crawler does not use Internet Explorer per se, proxy settings aren’t actually specific to Internet Explorer. Internet explorer, like the search crawler, uses wininet.dll. Wininet is a module that contains Internet-related functions used by Windows applications and it is this component that utilises proxy settings.

Sometimes people will troubleshoot this issue by using telnet to connect to the HTTP port. "ie: “Telnet web 80”. But telnet does not use the wininet component, so is actually not a valid method for testing. Telnet will happily report that the web server is listening on port 80 or 443, but it matters not when the crawler tries to access that port via the proxy. Furthermore, even if the crawler and the content source are on the same server, the result is the same. As soon as the crawler attempts to index a content source, the request will be routed to the proxy server. Depending on the vendor and configuration of the proxy server, various things can happen including:

  • The proxy server cannot handle the NTLM authentication and passes back a 400 error code to the crawler
  • The proxy server has funky stateful inspection which interferes with the allowed HTTP verbs in the communications and interferes with the crawl

For what its worth, it is not just proxy settings that can interfere with the HTTP communications between the crawler and the crawled. I have seen security software also get in the way, which monitors HTTP communications and pre-emptively terminates connections or modifies the content of the HTTP request. The effect is that the results passed back to the crawler are not what it expects and the crawler naturally reports that it could not access the data source with suitably weird error messages.

Now the very thing that makes this scenario hard to troubleshoot is the tell-tale sign for it. That is: nothing will be logged in the ULS logs, not the IIS logs for the search service. This is because the errors will be logged in the proxy server or the overly enthusiastic stateful security software.

If you suspect the problem is a proxy server issue,  but do not have access to the proxy server to check logs, the best way to troubleshoot this issue is to temporarily grant the search crawler account enough access to log into the server interactively. Open internet explorer and manually check the proxy settings. If you confirm a policy based proxy setting, you might be able to temporarily disable it and retry a crawl (until the next AD policy refresh reapplies the settings). The ideal way to cure this problem is to ask your friendly Active Directory administrator to either:

  • Remove the proxy altogether from the SharePoint server (watch for certificate revocation slowness as a result)
  • Configure an exclusion in the proxy settings for the AD policy to that the content sources for crawling are not proxied
  • Create a new AD policy specifically for the SharePoint box so that the default settings apply to the rest of the domain member computers.

If you suspect the issue might be overly zealous stateful inspection, temporarily disable all security-type software on the server and retry a crawl. Just remember, that if you have no logs on the server being crawled, chances are its not being crawled and you have to look elsewhere.

7. Pre-Windows 2000 Compatibility Access Group

In an earlier post of mine, I hit an issue where search would yield no results for a regular user, but a domain administrator could happily search SP2010 and get results. Another symptom associated with this particular problem is certain recurring errors event log – Event ID 28005 and 4625.

  • ID 28005 shows the message “An exception occurred while enqueueing a message in the target queue. Error: 15404, State: 19. Could not obtain information about Windows NT group/user ‘DOMAIN\someuser’, error code 0×5”.
  • The 4625 error would complain “An account failed to log on. Unknown user name or bad password status 0xc000006d, sub status 0xc0000064” or else “An Error occured during Logon, Status: 0xc000005e, Sub Status: 0x0”

If you turn up the debug logs inside SharePoint Central Administration for the “Query” and “Query Processor” functions of “SharePoint Server Search” you will get an error “AuthzInitializeContextFromSid failed with ERROR_ACCESS_DENIED. This error indicates that the account under which this process is executing may not have read access to the tokenGroupsGlobalAndUniversal attribute on the querying user’s Active Directory object. Query results which require non-Claims Windows authorization will not be returned to this querying user.

image

The fix is to add your search service account to a group called “Pre-Windows 2000 Compatibility Access” group. The issue is that SharePoint 2010 re-introduced something that was in SP2003 – an API call to a function called AuthzInitializeContextFromSid. Apparently it was not used in SP2007, but its back for SP2010. This particular function requires a certain permission in Active Directory and the “Pre-Windows 2000 Compatibility Access” group happens to have the right required to read the “tokenGroupsGlobalAndUniversal“ Active Directory attribute that is described in the debug error above.

8. Bloody developers!

Finally, Patrick Lamber blogs about another cause of crawler issues. In his case, someone developed a custom web part that had an exception thrown when the site was crawled. For whatever reason, this exception did not get thrown when the site was viewed normally via a browser. As a result no pages or content on the site could be crawled because all the crawler would see, no matter what it clicked would be the dreaded “An unexpected error has occurred”. When you think about it, any custom code that takes action based on browser parameters such as locale or language might cause an exception like this – and therefore cause the crawler some grief.

In Patricks case there was a second issue as well. His team had developed a custom HTTPModule that did some URL rewriting. As Patrick states “The indexer seemed to hate our redirections with the Response.Redirect command. I simply removed the automatic redirection on the indexing server. Afterwards, everything worked fine”.

In this case Patrick was using a multi-server farm with a dedicated index server, allowing him to remove the HTTP module for that one server. in smaller deployments you may not have this luxury. So apart from the obvious opportunity to bag programmers :-), this example nicely shows that it is easy for a 3rd party application or code to break search. What is important for developers to realise is that client web browsers are not the only thing that loads SharePoint pages.

If you are not aware, the user agent User Agent string identifies the type of client accessing a resource. This is the means by which sites figure out what browser you are using. A quick look at the User Agent parameter by SharePoint Server 2010 search reveals that it identifies itself as “Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)“. At the very least, test any custom user interface code such as web parts against this string, as well as check the crawl logs when it indexes any custom developed stuff.

Conclusion

Well, that’s pretty much my list of gotchas. No doubt there are lots more, but hopefully this slightly more detailed exploration of them might help some people.

 

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

www.spgovia.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Mar 31 2011

Consequences of complexity–the evilness of the SharePoint 2010 User Profile Service

Hiya

A few months back I posted a relatively well behaved rant over the ridiculously complex User Profile Service Application of SharePoint 2010. I think this component in particular epitomises SharePoint 2010’s awful combination of “design by committee” clunkiness, along with real-world sheltered Microsoft product manager groupthink which seems to rate success on the number of half baked features packed in, as opposed to how well those features install logically, integrate with other products and function properly in real-world scenarios.

Now truth be told, until yesterday, I have had an unblemished record with the User Profile Service – being able to successfully provision it first time at all sites I have visited (and no I did not resort to running it all as administrator). Of course, we all have Spence to thank for this with his rational guide. Nevertheless, I am strongly starting to think that I should write the irrational guide as a sort of bizzaro version of Spencers articles, which combines his rigour with some mega-ranting ;-).

So what happened to blemish my perfect record? Bloody Active Directory policies – that’s what.

In case you didn’t know, SharePoint uses a scaled down, pre-release version of Forefront Identify Manager. Presumably the logic here to this was to allow more flexibility, by two-way syncing to various directory services, thereby saving the SharePoint team development time and effort, as well as being able to tout yet another cool feature to the masses. Of course, the trade-off that the programmers overlooked is the insane complexity that they introduced as a result. I’m sure if you asked Microsoft’s support staff what they think of the UPS, they will tell you it has not worked out overly well. Whether that feedback has made it way back to the hallowed ground of the open-plan cubicles of SharePoint product development I can only guess. But I theorise that if Microsoft made their SharePoint devs accountable for providing front-line tech support for their components, they will suddenly understand why conspiracy theorist support and infrastructure guys act the way they do.

Anyway I better supress my desire for an all out rant and tell you the problem and the fix. The site in question was actually a fairly simple set-up. Two server farm and a single AD forest. About the only thing of significance from the absolute stock standard setup was that the active directory NETBIOS name did not match the active directory fully qualified domain name. But this is actually a well known and well covered by TechNet and Spence. A quick bit of PowerShell goodness and some AD permission configuration sorts the issue.

Yet when I provisioned the User Profile Service Application and then tried to start the User Profile Synchronisation Service on the server (the big, scary step that strikes fear into practitioners), I hit the sadly common “stuck on starting” error. The ULS logs told me utterly nothing of significance – even when i turned the debug juice to full throttle. The ever helpful windows event logs showed me Event ID 3:

ForeFront Identity Manager,
Level: Error

.Net SqlClient Data Provider: System.Data.SqlClient.SqlException: HostId is not registered
at Microsoft.ResourceManagement.Data.Exception.DataAccessExceptionManager.ThrowException(SqlException innerException)
at Microsoft.ResourceManagement.Data.DataAccess.RetrieveWorkflowDataForHostActivator(Int16 hostId, Int16 pingIntervalSecs, Int32 activeHostedWorkflowDefinitionsSequenceNumber, Int16 workflowControlMessagesMaxPerMinute, Int16 requestRecoveryMaxPerMinute, Int16 requestCleanupMaxPerMinute, Boolean runRequestRecoveryScan, Boolean& doPolicyApplicationDispatch, ReadOnlyCollection`1& activeHostedWorkflowDefinitions, ReadOnlyCollection`1& workflowControlMessages, List`1& requestsToRedispatch)
at Microsoft.ResourceManagement.Workflow.Hosting.HostActivator.RetrieveWorkflowDataForHostActivator()
at Microsoft.ResourceManagement.Workflow.Hosting.HostActivator.ActivateHosts(Object source, ElapsedEventArgs e)

The most common issue with this message is the NETBIOS issue I mentioned earlier. But in my case this proved to be fruitless. I also took Spence’s advice and installed the Feb 2011 cumulative update for SharePoint 2010, but to no avail. Every time I provisioned the UPS sync service, I received the above persistent error – many, many, many times. :-(

For what its worth, forget googling the above error because it is a bit of a red herring and you will find issues that will likely point you to the wrong places.

In my case, the key to the resolution lay in understanding my previously documented issue with the UPS and self-signed certificate creation. This time, I noticed that the certificates were successfully created before the above error happened.  MIISCLIENT showed no configuration had been written to Forefront Identity Manager at all. Then I remembered that the SharePoint User Profile Service Application talks to Forefront over HTTPS on port 5725. As soon as I remembered that HTTP was the communication mechanism, I had a strong suspicion on where the problem was – as I have seen this sort of crap before…

I wondered if some stupid proxy setting was getting in the way. Back in the halcyon days of SharePoint 2003, I had this issue when scheduling SMIGRATE tasks, where the account used to run SMIGRATE is configured to use a proxy server, would fail. To find out if this was the case here, a quick execute of the GPRESULT tool and we realised that there was a proxy configuration script applied at the domain level for all users. We then logged in as the farm account interactively (given that to provision the UPS it needs to be Administrator anyway this was not a problem). We then disabled all proxy configuration via Internet explorer and tried again.

Blammo! The service provisions and we are cooking with gas! it was the bloody proxy server. Reconfigure group policy and all is good.

Conclusion

The moral of the story is this. Anytime windows components communicate with each-other via HTTP, there is always a chance that some AD induced dumbass proxy setting might get in the way. If not that, stateful security apps that check out HTTP traffic or even a corrupted cache (as happened in this case). The ULS logs will never tell you much here, because the problem is not SharePoint per se, but the registry configuration enforced by policy.

So, to ensure that you do not get affected by this, configure all SharePoint servers to be excluded from proxy access, or configure the SharePoint farm account not to use a proxy server at all. (Watch for certificate revocation related slowness if you do this though).

Finally, I called this post “consequences of complexity” because this sort of problem is very tricky to identify the root cause. With so many variables in the mix, how the hell can people figure this sort of stuff out?

Seriously Microsoft, you need to adjust your measures of success to include resiliency of the platform!

 

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jan 09 2011

Migrating managed metadata term sets to another farm on another domain

Hiya

[Note: I have written another article that illustrates a process for migrating the entire term store between farms.]

My colleague Chris Tomich, recently saved the day with a nice PowerShell script that makes the management and migration of managed metadata between farms easier. If you are an elite developer type, you can scroll down to the end of this post. if you are a normal human, read on for background and context.

The Issue

Recently, I had a client who was decommissioning an old active directory domain and starting afresh. Unfortunately for me, there was a SharePoint 2010 farm on the old domain and they had made considerable effort in leveraging managed metadata.

Under normal circumstances, migrating SharePoint in this scenario is usually done via a content database migration into a newly set up farm. This is because the built in SharePoint 2010 backup/restore, although better than 2007, presumes that the underlying domain that you are restoring to is the same (bad things happen if it is not). Additionally, the complex interdependencies between service applications (especially the user profile service) means that restoring one service application like Managed metadata is fiddly and difficult.

I was constrained severely by time, so I decided to try a quick and dirty approach to see what would happen. I created a new, clean farm and provisioned a managed metadata service application. The client in question had half a dozen term sets in the old SP2010 farm and around 20 managed metadata columns in the site collection that leveraged them. I migrated the content database from the old farm to the new farm via the database attach method and that worked fine.

I then used the SolidQ export tool from codeplex to export those term sets from the managed metadata service of the source farm. Since the new SP2010 farm had a freshly provisioned managed metadata service application, re-importing those terms and term sets into this farm was a trivial process and went without a hitch.

I knew that managed metadata stores each term as a unique identifier (GUID I presume because programmers love GUID’s). Since I was migrating a content database to a different farm there were two implications.

  1. The terms stored in lists and libraries in that database would reference the GUID’s of the original managed metadata service
  2. The site columns themselves, would reference the GUID of the term sets of the source managed metadata service

But the GUIDs on the managed metadata service on the new farm will not match the originals. I did not import the terms with their GUID’s intact. Instead I simply imported terms which created new GUID’s. Now everything was orphaned. Managed metadata columns would not know which term store they are assigned to. For example: just because a term set called, say “Clients” exists in the source farm, when a term set is created in the destination farm with the same name, the managed metadata columns will not automatically “re-link” to the clients term set.

So what does a managed metadata column orphaned from a term set look like? Fortunately, SharePoint handles this gracefully. The leftmost image below shows a managed metadata column happily linked to its term set. The rightmost image shows what it looks like when that term set is deleted while the column remains. As you can see we now have no reference to a term set. However, it is easy to re-point the column to another term set.

image_thumb36 image_thumb37

When you examine list or library items that have managed metadata entered for them prior to the term set being deleted, SharePoint prevents the managed metadata control from being used (the Project Phase column below is greyed out). Importantly, the original data is not lost since SharePoint actually stores the term along with the rest of the item metadata, but the term cannot be updated of course as there is no term set to refer to.

image_thumb8

Once you reconnect a managed metadata column to a term set, the existing term will still be orphaned. The process of reconnecting does not do any sort of processing of terms to reconnect them. I show this below. In the example below, notice how this time, the managed metadata control is enabled again (as it has a valid term set), but the terms within it are marked red, signalling that they are invalid terms that do not exist in this term set. Notice though the second screen. By clicking on the orphaned term, managed metadata finds it in the associated term set and offers the right suggestion. With a single click, the term is re-linked back to the term store and the updated GUID is stored for the item. Too easy.

image_thumb21

image_thumb22

image_thumb23

As the sequence of events above show, it is fairly easy to re-link a term to the term set by clicking on it and letting managed metadata find that term in the term set. However, it is not something you would want to do for 10000 items in a list! In addition, the relative ease of relinking a term like the Location column above is elegant only because it is a single term that does not show the hierarchy. This is not the only configuration available for managed metadata though.

If you have set up your column to use multiple managed metadata entries, as well as allow multiple entries, you will have something like the screen below. In this example, we have three terms selected. “Borough”, “District” and “Neighbourhood”. All three of these terms are at the bottom of a deep hierarchy of terms. (eg Continent > Political Entity > Country > province or State > County or Region > City). As a result, things look ugly.

image_thumb25

Unfortunately in this use-case, our single click method will not work. Firstly, we have to click each and every term that has been added to this column one at a time. Secondly and more importantly, even if we do click it, the term will not be found in in the new term set. The only way to deal with this is to manually remove the term and re-enter or use the rather clumsy term picker as the sequence below shows.

image_thumb33

Note how as each term is selected, it is marked in black. We have to manually delete the orphaned terms in red.

image_thumb34 image_thumb35

Finally, after far too many mouse clicks, we are back in business.

image_thumb31

A better approach?

I showed this behaviour to my Seven Sigma colleague, Chris Tomich. Chris took a look at this issue and within a few hours trotted out a terrific PowerShell script to programmatically reconnect terms to a new term set. You can grab the source code for this script from Chris’s blog. The one thing Chris did tell me was that he wasn’t able to re-link a site or list column to its replacement term set. That bit you have to do yourself. This is because only the GUID of a term set is used against a column, which means without access to the original server, you do not know which term set to reattach.

Fortunately, there are not that many term sets to deal with, and the real pain is reattaching orphaned terms anyway.

Chris also added a ‘test’ mode to the script, so that it will flag if a field is currently connected to an invalid term set. This is very handy because you can run the script against a site collection and it will help you narrow down which columns are managed metadata and which need to be updated.

From the SharePoint 2010 Management Shell, the script syntax is: (assuming you call your script ReloadTermSets.ps1) –

.\ReloadTermSets.ps1 -web “http://localhost” -test -recurse

web – This is the web address to use. test – This uses a ‘test’ mode that will output lines detailing found fields and the values found for them in the term store and won’t save the items. recurse – This flag will cause the script to recurse through all child sites.

Of course, the usual disclaimers apply. Use the script/advice at your own risk. By using/viewing/distributing it you accept responsibility for any loss/corruption of data that may be incurred by said actions.

I hope that people find this useful. I sure did

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Aug 17 2010

Why me? Web part errors on new web applications

Oh man, it’s just not my week. After nailing a certificate issue yesterday that killed user profile provisioning, I get an even better one today! I’ve posted it here as a lesson on how not to troubleshoot this issue!

The symptoms:

I created a brand new web application on a SP2010 farm, and irrespective of the site collection I subsequently create, I get the dreaded error "Web Part Error: This page has encountered a critical error. Contact your system administrator if this problem persists"

Below is a screenshot of a web app using the team site template. Not so good huh?

image

The swearing…

So faced with this broken site, I do what any other self respecting SharePoint consultant would do. I silently cursed Microsoft for being at the root of all the world’s evils and took a peek into that very verbose and very cryptic place known as the ULS logs. Pretty soon I found messages like:

0x3348 SharePoint Foundation         General                       8sl3 High     DelegateControl: Exception thrown while building custom control ‘Microsoft.SharePoint.SPControlElement': This page has encountered a critical error. Contact your system administrator if this problem persists. eff89784-003b-43fd-9dde-8377c4191592

0x3348 SharePoint Foundation         Web Parts                     7935 Information http://sp:81/default.aspx – An unexpected error has been encountered in this Web Part.  Error: This page has encountered a critical error. Contact your system administrator if this problem persists.,

Okay, so that is about as helpful as a fart in an elevator, so I turned up the debug juice using that new, pretty debug juicer turner-upper (okay, the diagnostic logging section under monitoring in central admin). I turned on a variety of logs at different times including.

  • SharePoint Foundation           Configuration                   Verbose
  • SharePoint Foundation           General                         Verbose
  • SharePoint Foundation           Web Parts                       Verbose
  • SharePoint Foundation           Feature Infrastructure          Verbose
  • SharePoint Foundation           Fields                          Verbose
  • SharePoint Foundation           Web Controls                    Verbose
  • SharePoint Server               General                         Verbose
  • SharePoint Server               Setup and Upgrade               Verbose
  • SharePoint Server               Topology                        Verbose

While my logs got very big very quickly, I didn’t get much more detail apart from one gem,to me, seemed so innocuous amongst all the detail, yet so kind of.. fundamental :-)

0x3348 SharePoint Foundation         Web Parts                     emt7 High     Error: Failure in loading assembly: Microsoft.SharePoint, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a eff89784-003b-43fd-9dde-8377c4191592

That rather scary log message was then followed up by this one – which proved to be the clue I needed.

0x3348 SharePoint Foundation         Runtime                       6610 Critical Safe mode did not start successfully. This page has encountered a critical error. Contact your system administrator if this problem persists. eff89784-003b-43fd-9dde-8377c4191592

It was about this time that I also checked the event logs (I told you this post was about how not to troubleshoot) and I saw the same entry as above.

Log Name:      Application
Source:        Microsoft-SharePoint Products-SharePoint Foundation
Event ID:      6610
Description:
Safe mode did not start successfully. This page has encountered a critical error. Contact your system administrator if this problem persists.

I read the error message carefully. This problem was certainly persisting and I was the system administrator, so I contacted myself and resolved to search google for the “Safe mode did not start successfully” error.

The 46 minute mark epiphany

image

If you watch the TV series “House”, you will know that House always gets an epiphany around the 46 minute mark of the show, just in time to work out what the mystery illness is and save the day. Well, this is the 46 minute mark of this post!

I quickly found that others had this issue in the past, and it was the process where SharePoint checks web.config to process all of the controls marked as safe. If you have never seen this, it is the section of your SharePoint web application configuration file that looks like this:

image 

This particular version of the error is commonly seen when people deploy multiple servers in their SharePoint farm, and use a different file path for the INETPUB folder. In my case, this was a single server. So, although I knew I was on the right track, I knew this wasn’t the issue.

My next thought was to run the site in full trust mode, to see if that would make the site work. This is usually a setting that makes me mad when developers ask for it because it tells me they have been slack. I changed the entry

<trust level="WSS_Minimal" originUrl="" />

to

<trust level="Full" originUrl="" />

But to no avail. Whatever was causing this was not affected by code access security.

I reverted back to WSS_Minimal and decided to remove all of the SafeControl entries from the web.config file, as shown below. I knew the site would bleat about it, but was interested if the “Safe Mode” error would go away.

image

The result? My broken site was now less broken. It was still bitching, but now it appeared to be bitching more like what I was expecting.

image

After that, it was a matter of adding back the <safecontrol> elements and retrying the site. It didn’t take long to pinpoint the offending entry.

<SafeControl Assembly="Microsoft.SharePoint, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" Namespace="Microsoft.SharePoint.WebPartPages" TypeName="ContentEditorWebPart" Safe="False" />

As soon as I removed this entry the site came up fine. I even loaded up the content editor web part without this entry and it worked a treat. Therefore, how this spurious entry got there is still a mystery.

The final mystery

My colleague and I checked the web.config file in C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\CONFIG. This is the one that gets munged with other webconfig.* files when a new web application is provisioned.

Sure enough, its modified date was July 29 (just outside the range of the SharePoint and event logs unfortunately). When we compared against a known good file from another SharePoint site, we immediately saw the offending entry.

image

The solution store on this SharePoint server is empty and no 3rd party stuff to my knowledge has been installed here. But clearly this file has been modified. So, we did what any self respecting SharePoint consultant would do…

…we blamed the last guy.

 

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Aug 15 2010

More User Profile Sync issues in SP2010: Certificate Provisioning Fun

Wow, isn’t the SharePoint 2010 User Profile Service just a barrel of laughs. Without a bit of context, when you compare it to SP2007, you can do little but shake your head in bewilderment at how complex it now appears.

I have a theory about all of this. I think that this saga started over a beer in 2008 or so.

I think that Microsoft decided that SharePoint 2010 should be able to write back to Active Directory (something that AD purists dislike but sold Bamboo many copies of their sync tool). Presumably the SharePoint team get on really well with the Forefront Identify Manager team and over a few Friday beers, the FIM guys said “Why write your own? Use our fit for purpose tool that does exactly this. As an added bonus, you can sync to other directories easily too”.

“Damn, that *is* a good idea”, says the SharePoint team and the rest is history. Remember the old saying, the road to hell is paved with good intentions?

Anyways, when you provision the UPS enough times, and understand what Forefront Identity Manager does, it all starts to make sense. Of course, to have it make sense, requires you to mess it up in the first place and I think that everyone universally will do this – because it is essentially impossible to get it right the first time unless you run everything as domain administrator. This is a key factor that I feel did not get enough attention within the product team. I have now visited three sites where I have had to correct issues with the user profile service. Remember, not all of us do SharePoint all day – for the humble system administrator that is also catering with the overall network, this implementation is simply too complex. Result? Microsoft support engineers are going to get a lot of calls here – and its going to cost Microsoft that way.

One use-case they never tested

I am only going to talk about one of the issues today because Spence has written the definitive article that will get you through if you are doing it from scratch.

I went to a client site where they had attempted to provision the user profile synchronisation unsuccessfully. I have no idea of the things they tried because I wasn’t there unfortunately, but I made a few changes to permissions, AD rights and local security policy as per Spencers post. I then provisioned user profile sync again and I hit this issue. A sequence of 4 event log entries.

Event ID:      234
Description:
ILM Certificate could not be created: Cert step 2 could not be created: C:\Program Files\Microsoft Office Servers\14.0\Tools\MakeCert.exe -pe -sr LocalMachine -ss My -a sha1 -n CN=”ForefrontIdentityManager” -sky exchange -pe -in “ForefrontIdentityManager” -ir localmachine -is root

Event ID:      234
Description:
ILM Certificate could not be created: Cert could not be added: C:\Program Files\Microsoft Office Servers\14.0\Tools\CertMgr.exe -add -r LocalMachine -s My -c -n “ForefrontIdentityManager” -r LocalMachine -s TrustedPeople

Event ID:      234
Description:
ILM Certificate could not be created: netsh http error:netsh http add urlacl url=
http://+:5725/ user=Domain\spfarm sddl=D:(A;;GA;;;S-1-5-21-2972807998-902629894-2323022004-1104)

Event ID:      234
Description:
Cannot get the self issued certificate thumbprint:

The theory

Luckily this one of those rare times where the error message actually makes sense (well – if you have worked with PKI stuff before). Clearly something went wrong in the creation of certificates. Looking at the sequence of events, it seems that as part of provisioning ForeFront Identity Manager, a self signed certificate was created for the Computer Account, added to the Trusted People certificate store and then is used for SSL on a web application or web service listening on port 5725.

By the way, don’t go looking for the web app listening on such a port in IIS because its not there. Just like SQL Reporting Services, FIM likely uses very little of IIS and doesn’t need the overhead.  

The way I ended up troubleshooting this issue was to take a good look at the first error in the sequence and what the command was trying to do. Note the description in the event log is important here. “ILM Certificate could not be created: Cert step 2 could not be created”. So this implies that this command is the second step in a sequence and there was a step 1 that must have worked. Below is the step 2 command that was attempted.

C:\Program Files\Microsoft Office Servers\14.0\Tools\MakeCert.exe -pe -sr LocalMachine -ss My -a sha1 -n CN=”ForefrontIdentityManager” -sky exchange -pe -in “ForefrontIdentityManager” -ir localmachine -is root

When you create a certificate, it has to have a trusted issuer. Verisign and Thawte are examples and all browsers consider them trustworthy issuers. But we are not using 3rd party issuers here. Forefront uses a self-signed certificate. In other words, it trusts itself. We can infer that step 1 is the creation of this self-trusted certificate issuer by looking at the parameters of the MakeCert command that step 2 is using.

Now I am not going to annotate every Makecert parameter here, but the English version of the command above says something like:

Make me a shiny new certificate for the local machine account and call it “ForefrontIdentityManager”, issued by a root certificate that can be found in the trusted root store also called ForeFrontIdentityManager.

So this command implies that step 1 was the creation of that root certificate that issues the other certificates. (Product team – you could have given the name of the root issuer certificate something different to the issued certificate)

The root cause

Now that we have established a theory of what is going on, the next step is to run the failing Makecrt command from a prompt and see what we get back. Make sure you do this as the Sharepoint farm account so you are comparing apples with apples.

C:\Program Files\Microsoft Office Servers\14.0\Tools>MakeCert.exe -pe -sr LocalMachine -ss My -a sha1 -n CN=”ForefrontIdentityManager” -sky exchange -pe -in “ForefrontIdentityManager” -ir localmachine -is root

Error: There are more than one matching certificate in the issuer’s root cert store. Failed

Aha! so what do we have here? The error message states that we have more than 1 matching certificate in the issuers root certificate store.

For what its worth it is the parameters “-ir localmachine -is root” that specifies the certificate store to use. In this case, it is the trusted root certificate store on the local computer.

So lets go and take a look. Run the Microsoft Management Console (MMC) and Choose “Add/Remove Snap In” from the File Menu.

image

From the list of snap ins choose Certificates and then choose “Computer Account”

image

Now in the list of certificate stores, we need to examine the one that the command refers to: The Trusted Root Certification Authorities store. Well, look at that, the error was telling the truth!

image

Clearly the Forefront Identity Manager provisioning/unprovisioning code does not check for all circumstances. I can only theorise what my client did to cause this situation because I wasn’t privy to what was done on this particular install before I got there. but step 1 of this provisioning process would create an issuing certificate whether one existed already or not. Step 2 then failed because it had no way to determine which of these certificates is the authoritative one.

This was further exacerbated because each re-attempt creates another root certificate because there is no check whether a certificate already exists.

The cure is quite easy. Just delete all of the ForefrontIdentityManager certificates from the Trusted Root Certification Authorities and re-provision the user profile sync in SharePoint. Provided that there is no certificate in this store to begin with, step 1 will create it and step 2 will then be able to create the self signed certificate using this issuer just fine.

Conclusion (and minor rant)

Many SharePoint pros have commented on the insane complexity of the new design of the user profile sync system. Yes I understand the increased flexibility offered by the new regime, leveraging a mature product like Forefront, but I see that with all of this flexibility comes risk that has not been well accounted for. SP2010 is twice as tough to learn as SP2007 and it is more likely that you will make a mistake than not making one. The more components added, the more points of failure and the less capable over-burdened support staff are in dealing with it when it happens.

SharePoint 2010 is barely out of nappies and I have already been in a remediation role for several sites over the user profile service alone.

I propose that Microsoft add a new program level KPI to rate how well they are doing with their SharePoint product development. That KPI should be something like % of time a system administrator can provision a feature without making a mistake or resorting to running it all as admin. The benefit to Microsoft would be tangible in terms of support calls and failed implementations. Such a KPI would force the product team to look at an example like the user profile service and think “How can we make this more resilient?”. “How can we remove the number of manual steps that are not obvious?”, “how can we make the wizard clearer to understand?” (yes they *will* use the wizard).

Right now it feels like the KPI was how many features could be crammed, in as well as how much integration between components there is. If there is indeed a KPI for that they definitely nailed it this time around.

Don’t get me wrong – its all good stuff, but if Microsoft are stumping seasoned SharePoint pros with this stuff, then they definitely need to change the focus a bit in terms of what constitutes a good product.

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Aug 03 2009

Troubleshooting SPSearch and good practices for moving large files

Every so often I get back into geek mode and roll up the sleeves and get my SharePoint hands dirty. Not so long ago I was assisting a small WSS3 based organisation with a major disk subsystem upgrade (iSCSI SAN) and locally attached storage upgrade, driven by content database growth and a need to convert some sites to site collections. By the end of it all, we had a much better set-up with a much better performing disk subsystem, but I was hit by two problems. One was WSS Search being broken and needing a fix and the other was the appallingly slow speed of copying large files around.

So, let’s talk about fixing broken search and then talk about copying large files.

1. When SPSearch breaks…

The SharePoint install in question was a WSS3 site with SP1, and Search Server 2008 had not been installed. The partition that contained the indexes (F:\) had to be moved to another partition (G:), so to achieve this I used the command

stsadm –o spsearch indexlocation G:\DATA\INDEX

Now, 99.9% of the time this will happily do its thing. But today was the 0.01% of the time when it decided to be difficult. Upon executing this command, I received an RPC error. Unfortunately for me, I was out of time, so I decided to completely re-provision search and start all over again.

It didn’t matter whether I tried this in Central Administration->Operations->Services on Server, or via the command line below. Both methods would not work.

stsadm -o spsearch -action stop

On each attempt, search would get stuck on unprovisioning (stopping) with a sequence of events in the event log (eventID 2457, 2429 and 2428).

===========================================================================================
Event Source: Windows SharePoint Services 3 Search 
Event Category: Gatherer 
Event ID: 2457 
Description: 

The plug-in manager <SPSearch.Indexer.1> cannot be initialized. 
Context: Application 'Search index file on the search server' 

Details: 
The system cannot find the path specified. (0x80070003) 

===========================================================================================  
Event Source: Windows SharePoint Services 3 Search 
Event Category: Gatherer 
Event ID: 2429 
Description: 

The plug-in in <SPSearch.Indexer.1> cannot be initialized. 
Context: Application '93a1818d-a5ec-40e1-82d2-ffd8081e3b6e', Catalog 'Search' 

Details: 
The specified object cannot be found. Specify the name of an existing object. (0x80040d06) 

===========================================================================================
Event Source: Windows SharePoint Services 3 Search 
Event Category: Gatherer 
Event ID: 2428 
Description: 

The gatherer object cannot be initialized. 
Context: Application 'Search index file on the search server', Catalog 'Search' 

Details: 
The specified object cannot be found. Specify the name of an existing object. (0x80040d06) 

 

So, as you can see I was stuck. I couldn’t not clear the existing configuration and the search service would never actually stop. In the end, I started to wonder whether the problem was that my failed attempt to change the index partition had perhaps not reapplied permissions to the new location. To be sure I reapplied permissions using the following STSADM command

psconfig -cmd secureresources

This seemed to do the trick. Re-executing the stsadm spsearch stop command finally did not come up with an error and the service was listed as stopped.

image

Once stopped, we repartitioned the disks accordingly and now all I had to do was start the damn thing :-)

Through the Central Administration GUI I clicked Start and re-entered all of the configuration settings, including service accounts and the new index location (G:\DATA\INDEX). After a short time, I received the ever helpful “Unknown Error” error message.

image

Rather than change debug settings in web.config, I simply checked the SharePoint logs and the event viewer. Now, I had a new event in the logs.

Event Type: Warning 
Event Source: Windows SharePoint Services 3 Search 
Event Category: Gatherer 
Event ID: 10035 
Description: 
Could not import the registry hive into the registry because it does not exist in the configuration database. 
Context: Application '93a1818d-a5ec-40e1-82d2-ffd8081e3b6e' 

Hmm… It suggests a registry issue, so I checked the registry.

image

 

Although the error message really made no sense to me, checking the registry turned out to be the key to solving this mystery. If you look carefully in the above screenshot, note that the registry key DataDirectory was set to “F:\DATA\INDEX”.

I was surprised at this, because I had re-provisioned the SPSearch to use the new location (G:\DATA\INDEX). I would have thought that changing the default index location would alter the value of this key. A delve into the ULS logs showed events like this.

STSADM.EXE (0x0B38) 0x0830 Search Server Common MS Search Administration 95k1 High WSS Search index move: Changing index location from ‘F:\data\index’ to ‘G:\data\index’.

STSADM.EXE (0x0B38) 0x0830 Search Server Common MS Search Administration 95k2 High WSS Search index move: Index location changed to ‘G:\data\index’.

STSADM.EXE (0x0B38) 0x0830 Search Server Common MS Search Administration 0 High CopyIndexFiles: Source directory ‘F:\data\index\93a1818d-a5ec-40e1-82d2-ffd8081e3b6e’ not found for application ’93a1818d-a5ec-40e1-82d2-ffd8081e3b6e’.

STSADM.EXE (0x0F10) 0x1558 Windows SharePoint Services Topology 8xqz Medium Updating SPPersistedObject SPSearchServiceInstance Parent=SPServer Name=DAPERWS03. Version: 218342 Ensure: 0, HashCode: 54267293, Id: 305c06d7-ec6d-465a-98be-1eafe64d8752, Stack: at Microsoft.SharePoint.Administration.SPPersistedObject.Update() at Microsoft.SharePoint.Administration.SPServiceInstance.Update() at Microsoft.SharePoint.Search.Administration.SPSearchServiceInstance.Update() at Microsoft.Search.Administration.CommandLine.ActionParameter.Run(StringBuilder& output) at Microsoft.SharePoint.Search.Administration.CommandLine.SPSearch.Execute() at Microsoft.Search.Administration.CommandLine.CommandBase.Run(String command, StringDictionary keyValues, String& output) at Microsoft.SharePoint.StsAdmin.SPStsAdmin.RunOperation(SPGlobalAdmin globalAdmin, String st…

mssearch.exe (0x1654) 0x1694 Search Server Common IDXPIPlugin 0 Monitorable CTripoliPiMgr::InitializeNew – _CopyNoiseFiles returned 0x80070003 – File:d:\office\source\search\ytrip\search\tripoliplugin\tripolipimgr.cxx Line:519

mssearch.exe (0x1654) 0x1694 Search Server Common Exceptions 0 Monitorable <Exception><HR>0x80070003</HR><eip>0000000001D4127F</eip><module>d:\office\source\search\ytrip\search\tripoliplugin\tripolipimgr.cxx</module><line>520</line></Exception>

Note the second last line above (marked bold and italic). It showed that a function called CopyNoiseFiles returned a code of 0x8007003. This code happens to be “The system cannot find the path specified,” so it appears something is missing.

It then dawned on me. Perhaps the SharePoint installer puts some files into the initially configured index location and despite moving the index to another location, SharePoint still looks to this original location for some necessary files. To test this, I loaded up a blank Windows 2003 VM and installed SharePoint SP1 *without* running the configuration wizard. When I looked in the location of the index files, sure enough – there are some files as shown below.

image

It turned out that during our disk reconfiguration, the path of F:\DATA\INDEX no longer existed. So I recreated the path specified in the registry (F:\DATA\INDEX) and copied the contents of the CONFIG folder from my fresh VM install. I then started the search service from Central Administration and… bingo! Search finally started successfully…Wohoo!

Now that I had search successfully provisioned, I re-ran the command to change the index location to G:\DATA\INDEX and then started a full crawl.

C:\>stsadm -o spsearch -indexlocation G:\DATA\INDEX

Operation completed successfully.

C:\>stsadm -o spsearch -action fullcrawlstart

Operation completed successfully.

I then checked the event logs and now it seems we are cooking with gas!

Event Type: Information 
Event Source: Windows SharePoint Services 3 Search 
Event Category: Gatherer 
Event ID: 10045 
Description: 

Successfully imported the application configuration snapshot into the registry. 
Context: Application '93a1818d-a5ec-40e1-82d2-ffd8081e3b6e' 

Event Type: Information 
Event Source: Windows SharePoint Services 3 Search 
Event Category: Gatherer 
Event ID: 10044 
Description: 

Successfully stored the application configuration registry snapshot in the database. 
Context: Application 'Serve search queries over help content' 

As a final check, I re-examined the registry and noted that the DataDirectory key had not changed to reflect G:\. Bear that in mind when moving your index to another location. The original path may still be referred to in the configuration.

2. There are RAID cards and there are RAID cards

Since part of the upgrade work was to improve disk performance, we had to detach databases and move them around while we upgraded the disk infrastructure and repartitioned existing disk arrays. The server had an on-board Intel RAID controller with two arrays configured. One was a two-disk RAID 0 SCSI and the other was a three-disk RAID 5 SATA array. The performance of the RAID 5 SATA had always been crap – even crappier than you would expect from onboard RAID 5. When I say crap, I am talking around 35 megabytes per second transfer rate – even on the two-disk SCSI RAID 0 array.

Now, 35MB/sec isn’t great, but not completely intolerable. But, what made this much, much worse however, was the extreme slowness in copying large files (ie >4GB). When trying to copy files like this to the array, the throughput dropped to as low as 2MBs.

No matter whether it was Windows Explorer drag and drop or a command line utility like ROBOCOPY, the behaviour was the same. Throughput would be terrific for around 60 seconds, and then it would drop as shown below.

image

My client called the server vendor and was advised to purchase 4 SCSI disks to replace the SATA’s. Apparently the poor SATA performance was actually because SCSI and SATA were mixed on the same RAID bard and bus. That was a no-no.

Sounded plausible, but of course, after replacing the RAID 5 with the SCSI disks, there was no significant improvement in disk throughput at all. The performance of large files still reflected the pattern illustrated in the screenshot above.

Monitoring disk queue length on the new RAID array showed that disk queues were off the planet in terms of within normal boundaries. Now, I know that some people view disk queue length as a bit of an urban myth in terms of disk performance, but copying the same files to the iSCSI SAN yielded a throughput rate of around 95MBs and the disk queue value rarely spiked above 2.

Damn! My client wasn’t impressed with his well known server vendor! Not only does the onboard RAID card have average to crappy performance to begin with, RAID 5 with large files makes it positively useless.

Fun with buffers

To me, this smelt like a buffer type of issue. Imagine you are pouring sand into a bucket and the bucket has a hole in it. If you pour sand into the bucket at a faster rate than the hole allows sand to pour out, then eventually you will overflow the bucket. I suspected this sort of thing was happening here. The periods of high throughput were when the bucket was empty and the sand filled it fast. Then the bucket filled up and things slowed to a crawl while all of that sand passed through the metaphorical hole in the bottom. Once the bucket emptied, there was another all-too-brief burst of throughput as it filled quickly again.

I soon found a terrific article from EPS Windows Server Performance Team that explain what was going on very clearly.

Most file copy utilities like Robocopy or Xcopy call API functions that try and improve performance by keeping data in a buffer. The idea is that files that are changed or accessed frequently can be pulled from the buffer, thereby improving performance and responsiveness. But there is a trade-off. Adding this cache layer introduces an overhead in creating this buffer in the first place. If you are never going to access to copy this file again, adding it to the buffer is actually a bad idea.

Now imagine a huge file. Not only do you have the buffer overhead, but you now are also filling the buffer (and forcing it to be flushed), over and over again.

With a large file, you are actually better off avoiding the buffer altogether and doing a raw file copy. Any large file on a slow RAID card will still take time, but it’s a heck of a lot quicker than when combined with the buffering overhead.

Raw file copy methods

In the aforementioned article from the Microsoft EPS Server performance team, they suggest ESEUTIL as an alternative method. I hope they don’t mind me quoting them…

For copying files around the network that are very large, my copy utility of choice is ESEUTIL which is one of the database utilities provided with Exchange.  To get ESEUTIL working on a non-Exchange server, you just need to copy the ESEUTIL.EXE and ESE.DLL from your Exchange server to a folder on your client machine.  It’s that easy.  There are x86 & x64 versions of ESEUTIL, so make sure you use the right version for your operating system.  The syntax for ESEUTIL is very simple: eseutil /y <srcfile> /d <destfile>.  Of course, since we’re using command line syntax – we can use ESEUTIL in batch files or scripts.  ESEUTIL is dependent on the Visual C++ Runtime Library which is available as a redistributable package

I found an alternative to this, however, that proved its worth to me. It is called Teracopy and we tried the free edition to see what difference it would make in terms of copy times. As it happens, the difference was significant and the time taken to transfer large files around was reduced by a factor of 5. Teracopy also produced a nice running summary of the copy thoughput in MB/sec.The product definitely proved its worth and at a whopping $15 bucks, is not going to break the bank.

So, if you are doing any sort of large file copies and your underlying disk subsystem is not overly fast, then I recommend taking a look at this product. Believe me, it will save you a heap of time.

3. Test your throughput

A final note about this saga. Anyone who deals with SQL Server will have likely read articles about best practice disk configuration in terms of splitting data/logs/backups to different disk arrays to maximise throughput. This client had done this, but since Teracopy gave us nice throughput stats, we took the opportunity to test the read/write performance of all disk subsystems and it turned out that putting ALL data onto the SAN array had significantly better performance than using any of the onboard arrays.

This meant that the by-the-book configuration was hamstrung by a poorly performing onboard RAID controller and even if the idea of using separate disks/spindles seemed logical, the cold hard facts of direct throughput testing proved otherwise.

After reconfiguring the environment to leverage this fact, the difference in response time of WSS, when performing bulk uploads was immediately noticeable.

The moral of the story?

If you are a smaller organisation and can’t afford the high end server gear like Compaq/IBM, then take the time to run throughput tests before you go to production. The results may surprise you.

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Next Page »

Today is: Monday 22 December 2014 |