Back to Cleverworkarounds mainpage
 

Jan 22 2013

Confession of a (post) SharePoint architect… What are you polishing?

Send to Kindle

Hi and welcome to the latest exciting instalment in my epic series of posts on my confessions of a post SharePoint architect. I was motivated to write this series because the mild mannered shrinking violet known as Bjorn Furuknap wrote an insightful series of articles on what it takes to be a SharePoint professional. I had always planned to write on this topic as well and opted to frame it as SharePoint “confessions” because some of my approaches do not always seem mainstream (but work!) so it feels like I am confessing my sins for using them. I chose to use the word “(post)” because SharePoint is not my fulltime gig anymore. I am very lucky to do a lot of non IT work, helping organisations deal with highly complex problems. This side of my work is where most of my insights have come from and what inspired the Heretics Guide to Best Practices book.

Thus far, we have traversed a fair bit of territory via the use of f-laws – home truths about successful SharePoint delivery that focuses on areas often overlooked for various reasons. In case this is your first visit to this series, we have covered 6 f-laws so far and I strongly suggest that you read them first…

In this post, we are continuing with f-law 6, focusing on aspects to SharePoint delivery where geeks have a habit of being crap…

No matter how much you polish it…

In Australia, there is an old saying, that no matter how much you try and polish a turd, it will always be a turd. In the last post, I more or less stated that some geeks have a tendency to polish turds. They do this because of a combination of an inflated view of their self-importance, mental scars from ghosts of disaster recoveries past, and a bias toward something I termed dial tone governance.

Dial tone governance refers to all of the stuff that ensures that the SharePoint platform remains pristine, consistently reliable and high performing. I noted in the previous post that this echoes what quality assurance aspires to do. This type of IT assurance for SharePoint is completely necessary, but it is definitely not sufficient. If it was, lavish praise would be heaped upon us heroic geeks for consistent fantastic SharePoint delivery.

In the last post I also channelled Neo from the Matrix and suggested that being a hero like Neo is a thankless job since, for many stakeholders, the assurance of dial tone is assumed to be there. Whether this is right or wrong is not the point, because geeks do not survive their own hypocrisy on this matter. After all, no one thanks the telephone company for providing them with dial tone when they pick up the phone to make a call – they just get pissed when it is not there.

Now while I can sympathise with unloved telephone company engineers, they actually have it easy because once they provide dial tone, their job is done. This also applies to tools like Microsoft Exchange, Virtualisation, IP networks and storage. Unfortunately with SharePoint, successful delivery is not judged on whether the level of dial tone is appropriate. At the end of the day no amount of turd polishing via awesome support, consistent process or fast response time will make a crap solution anything other than a crap solution.

So this raises a couple of questions that readers should consider:

  1. Am I focusing too much (or too little) on dial tone governance?
  2. What are the other governance areas that I need to focus on?

As it happens, I have some data we can use to answer them.

The hardest thing…

In 2009 I created my SharePoint Governance and Information Architecture class. The class is attended by a wide variety of roles, from BAs to PMs, CIOs as well as developers and tech dudes. It has been delivered around the world with students representing just about every industry sector (including Microsoft employees). This combination of varied audience, varied industry sectors and geographic location has provided a lot of insight, because at the start of every class, I ask students to introduce themselves, and tell the class what they feel is the hardest thing about SharePoint delivery and I dialogue map the answers.

Can you see the logic of this question? By listing all of the areas that is hard about SharePoint delivery, what should emerge are the areas we should be focusing on. Why? Well the hard bits are likely to be the areas of most risk when it comes to a failed or stressed deployment.

So let’s go through the answers given to me from a few SPGovIA classes. Maybe there are some consistent patterns that emerge. It will also be interesting to see how much of it is dial tone governance.

Brisbane 2012 and Melbourne 2010

First up, here are the answers I captured from a small class in Brisbane in 2012:

  • Explaining what SharePoint is
  • User uptake (“People do not like new things”)
  • Managing proliferation of SharePoint sites
  • Too much IT ownership (“Sick of IT people telling me that SP is the solution”)
  • Users don’t know what they want
  • Difficulties around SP ownership because of a lack of accountability

For me some interesting things emerge already, but before we get into detail, let’s look at a Melbourne class answering this question two years earlier and see if any consistent patterns.

  • Every project is “new” (“Traditional ASP.NET web site development is ‘same old same old’)
  • In SharePoint you can do things in many ways so the initial design takes longer
  • The solution is never the same as the initial design and the end client may not realise this. The implication is gaps between expectation and delivery
  • Stakeholders don’t know what they want (“First time around what they sign off on is not what they want “)
  • Projects launched as “IT projects” with no clear deliverable and no success indicators
  • Lack of visibility as to what other organisations are doing
  • Determining limits and boundaries (“Doing anything ‘practically’ in SharePoint is hard”).
    • For example: We improved Ux in certain areas, but to extend to the entire feature set would take too long”
  • Managing expectations around SharePoint.
    • Clients with no experience think it can do everything
    • Difficulties getting information from and translating into design, so it can be implemented
  • Legacy of bad implementations makes it hard to win the business owner
  • Lack of governance
    • Viral spread of unmanaged sites
    • No proper requirements of “why”
    • No-one managing it

Some analysis…

The first thing that I notice is that if you go back to the start of this post and review the six f-laws, four are clearly represented here. We have stakeholders not knowing what they want which makes design hard (f-law 2), the gap between expectation and delivery (f-law 5), the problem of SharePoint projects being done as “IT projects” (f-law 6) with “no clear deliverables and no success indicators” (f-law 3). Other themes include lack of accountability and managing viral growth of sites, but the overwhelming theme that comes through for me is that of managing user expectations and buy-in.

A telling part about what is listed is that aside from the ever present issue of managing site sprawl, not too much of it is dial tone at this point. To see if this pattern continues, let’s head to Auckland New Zealand and see if the Kiwis are any more geeky than their Australian cousins…

Auckland 2011

  • Gap between expectations and reality
  • Accountabilities and role clarity around delivery
  • Knowledge transfer and ongoing maintenance (“Not everything is written down and when people leave, key critical information is lost. For example: Business rules set up at the start are lost over time”)
  • Helping people change practices (“Getting people to use it “)
  • Managing the growth over time (“the challenge of a large user base wanting to use it in different ways”)
  • It’s a big, complex product
  • The perception of “mystique” around SharePoint (“hard to know what not to do”)
  • Seen as another “IT service”
    • product selected because it’s Microsoft
    • the people who chose the product/delivering the product are IT
  • Translating the capabilities of the product to the needs of the user
    • Getting the business to understand SharePoint’s capability
    • Restrictions vs freedom
  • Ramp up time: The learning curve across all roles (tech and non tech)
  • The challenge of user requirements: Knowing the right questions to ask

Some more analysis…

It is clear that the themes that emerged from the two classes in Australia are also consistent here. The issue of stakeholder expectations came up straight away as well as the “IT driven project” issue (“seen as another ‘IT’ service”). Once again, the only real dial tone governance issue was the problem of managing site growth over time, but even then, it was framed more of an expectations issue (“the challenge of a large user base wanting to use it in different ways”). F-law 4 also copped a mention in terms of knowing the right questions to ask to get the right user requirements.

The additional themes that I noted from this group were:

  • complexity (“It’s a big, complex product“)
  • change management (“helping people change practices”)
  • the high learning curve of SharePoint for users
  • knowledge transfer over time the challenge of a large user base wanting to use the product in different ways.

<geek alert>Now if you are reading this and you manage complex infrastructure, let me assure you that tech people were in the classes</geek alert>. Also, since Australia and New Zealand are culturally quite similar to each other, it could be argued that we are not taking into account different cultures. So let’s find out what a 2012 class in Singapore had to say…

Singapore 2012

  • Trying to deal with the sheer number of features
  • “A totally different kind of concept”
    • A little knowledge can be dangerous
    • If you start with the wrong footing, you end up messing it up
  • Trying to deal with “I need SharePoint”
  • SharePoint for an external web site was difficult to use. Unfriendly structure for a public facing website
  • Trying to get users to use it (Steep learning curve for users)
  • The need for “deep discussion” to ensure SharePoint is put in for the right reasons. Without this, the result is messy, disorganised portals
  • The gap between the business and IT results in a sub optimal deployment
  • Demonstrating value to the business (SharePoint installed, but its potential is not being realized)
  • Stakeholders not appreciating the implication of product versus platform
  • You are working across the entire business (The disconnect between management/coalface)
  • “Everything hurts with SharePoint”
  • Facilitating the discussion at the business level is hard when your background is IT

Final Analysis

Once again the answers provided by Singapore attendees is extraordinarily consistent with the other three classes we looked at. User expectations and adoption were at the forefront, complexity was there, as was the business/IT disconnect as well as demonstrating business value. The theme of platitudes (f-law 4) and confusing the means from the ends (f-law 1) was apparent with the comment about dealing with the “I need SharePoint” issue.

I also note that the Singapore group seemed to have a greater recognition of their weaknesses – particularly with SharePoint as a “totally different type of concept” quote and last comment about difficulties facilitating discussion “when your background is IT”. I also noted one potential dial-tone comment about the difficulty of deploying SharePoint as a public facing website. Another emergent complexity related theme here is the perennial problem of SharePoint’s ample supply of features (and caveats) which risks an inappropriate up-front design decision that has negative consequences later (“Trying to deal with the sheer number of features,“ “A little knowledge can be dangerous” & “If you start with the wrong footing, you end up messing it up.”)

Finally, I particularly liked the comment about the “need for “deep discussion” to ensure SharePoint is put in for the right reasons” – that one was made by one of the Microsofties who attended the class.

Conclusions and takeaways

My own conclusion from this examination is that the responses from class attendees illustrate that dial tone governance (which is best termed as IT assurance) is necessary, but certainly not sufficient. The focus on IT assurance is a reflection of the lens that IT looks through. After all, when your performance is judged on keeping things running smoothly and reliably, it makes sense that you will focus on this.

But as illustrated by the responses, it seems that IT assurance isn’t all that hard. If it was, then why didn’t dial tone topics come up with more frequency in the responses?

So IT people, always remember f-law 1. The word ‘govern’ means to steer. We aim to steer the energy and resources available for the greatest benefit to all. Assurance on the other hand provides confidence in a product’s suitability for its intended purpose. It is a set of activities intended to ensure that customer requirements are satisfied in a systematic, reliable fashion. (I didn’t make that up by the way – that is how the ISO9000 family of standards for quality management described assurance).

The key takeaway is that to be effective and successful you actually need to apply both governance and assurance. You cannot have one without the other. Whether you have the balance right between dial tone and all the other stuff is for you to decide. So rather than focus on the stuff you already know well, perhaps it is worth asking yourself what you find hard and focus there as well.

 

Thanks for reading

 

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Dec 30 2012

Confessions of a (post) SharePoint architect: The dangers of dial tone governance…

Send to Kindle

Hi all and welcome to the next exciting instalment of my confessions from my work as a SharePoint architect and beyond. This is the eighth post and my last for 2012, so I will get straight into it.

To recap, along the way we have examined 5 f-laws and learned that:

Now, as a preamble to today’s mini-rant, I need to ‘fess up. I know this might come as a shock, but there was once a time when I was not the sweet, kind hearted, gentle soul who pens these articles. In my younger days, I used to judge my self-worth on my level of technical knowledge. As a result of this, I knew my stuff, but was completely oblivious to how much of a pain in the ass I was to everyone but geeks who judged themselves similarly. Met anyone like that in IT? Smile

This brings me onto my next SharePoint governance f-law – one that highlights a common blind spot that many IT people have in their approach to SharePoint governance.

F-Law 6: Geeks are far less important than they think they are

All disciplines and organisational departments have a particular slant on reality that is based on them at the centre of that reality. If this was not true, then departments would not spend so much time bitching about other departments and I would have no Dialogue Mapping work. The IT department is no better or worse in this regard than any other department, except that the effects of their particular slant of reality can be more pronounced and far reaching on everyone else. Why? Because the IT slant of reality sometimes looks like a version of Neo from the Matrix. Many, if not most people in IT, have a little Neo inside of them.

image

We all know Neo – an uber hero. He is wise, blessed with super powers, can manipulate your very reality and is a master of all domains of knowledge. Neo is also your last hope because if he goes down, we all go down. Therefore, everything Neo does – no matter how over the top or what the consequences are – is necessary to save the world from evil.

All of the little Neos in IT have a few things in common with bullet stopping big Neo above. Firstly, little Neo has also been entrusted with ensuring that the environment is safe from the forces of evil. Secondly, Little Neo can manipulate the reality that everybody else experiences. And finally, little Neo is often the last hope when things go bad. But that is where the similarities end because big Neo has two massive advantages over little Neo. First, big Neo was a master of a lot of domains of knowledge because he had the convenience of being able to learn any new subject by downloading it into his brain. Little Neo does not have this convenience, yet many little Neos still think they are all-knowing and wise. Secondly, big Neo was never mentally scarred from a really bad tequila bender…

Bad tequila bender? What the…

Never again…

Years ago when I was young and dumb, I was at a party drinking some tequila using the lemon and salt method. My brother-in-law thought it would be hilarious to switch my tequila shots with vodka double shots. Unfortunately for me, I didn’t notice because the lemon and salt masked the taste. I downed a heap of vodkas and the net result for me was not pretty at all. Although I wasn’t quite as unfortunate as the guy in the picture below I wasn’t that far off. As a result, to this day I cannot bring myself to drink tequila or vodka and the smell of it makes me feel sick with painful memories best left supressed.

image

I’m sure many readers can relate to a story like this. Most people have had a similar experience from an alcohol bender, eating a dodgy oyster or accidentally drinking tap water in a place like Bali. So take a moment to reflect on your absolute worst experience until you feel clammy and queasy. Feeling nauseous? Well guess what – there is something even worse…

Anyone who has ever worked in a system administrator role or any sort of infrastructure support will know the feeling of utter dread when the after hours pager goes off, alerting you some sort of problem with the IT infrastructure on which the business depends. Like many, I have lived through disaster recovery scenarios and let me tell you – they are not fun. Everyone turns to little Neo to save the day. It is high pressure and stressful trying to get things back on track, with your boss breathing down your neck, knowing that the entire company is severely degraded until you to get things back online.

Now while that is bad enough, the absolute nightmare scenario for every little Neo in IT is having to pick up the pieces of something not of their doing in the first place. In other words, somehow a non-production system morphed into production and nobody bothered to tell Little Neo. In this situation, not only is there the pressure to get things back as quickly as possible, but Little Neo has no background knowledge of the system being recovered, has no documentation on what to do, never backed it up properly and yet the business expects it back pronto.

So what do you expect will happen in the aftermath of a situation like the one I described above? Like my aversion to tequila, Little Neo will develop a pathological desire to avoid reliving that sort of pain and stress. It will be an all-consuming focus, overriding or trivialising other considerations. Governance for little Neo is all about avoiding risk and just like Big Neo, any actions – no matter how over the top or what the consequences are – will be deemed as necessary to ensure that risk is mitigated. Consequently, a common characteristic of lots of little Neos is the classic conservative IT department who defaults to “No” to pretty much any question that involves introducing something new. Accordingly, governance material will abound with service delivery aspects such as lovingly documented physical and logical architecture, performance testing regimes, defining universal site templates, defining security permissions/policies, allowed columns, content types and branding/styling standards.

Now all of this is nice and needs to be done. But there is a teeny problem. This quest to reduce risk has the opposite effect. It actually increases it because little Neo’s notion of governance is just one piece of the puzzle. It is the “dial tone” of SharePoint governance.

The thing about dial tone…

What is the first thing you hear when you pick up the phone to make a call? The answer of course is dial tone.

Years ago, Ruven Gotz asked me if I had ever picked up the phone, heard dial tone and thought “Ah, dial tone… Those engineers down at the phone company are doing a great job. I ought to bake them a  cake to thank them.” Of course, my answer was “No” and if anyone ever answered “Yes” then I suspect they have issues.

This highlights an oft-overlooked issue that afflicts all Neos. Being a hero is a thankless job. The reality is that the vast majority of the world could not care less that there is dial tone because it is expected to be there – a minimum condition of satisfaction that underpins everything else. In fact, the only time they notice dial tone is when it’s not there.

Yet, when you look at the vast majority of SharePoint governance material online, it could easily be described as “dial tone governance.” It places the majority of focus on the dial tone (service delivery) aspects of SharePoint and as a result, de-emphasises much more important factors of governance. Little Neo, unfortunately, has a governance bias that is skewed towards dial tone.

Keen eyed readers might be thinking that dial tone governance is more along the lines of what quality assurance is trying to do. I agree. Remember in part 2 of this series, I explained that the word ‘govern’ means to steer. We aim to steer the energy and resources available for the greatest benefit to all. Assurance, according to the ISO9000 family of standards for quality management, provides confidence in a product’s suitability for its intended purpose. It is a set of activities intended to ensure that requirements are satisfied in a systematic, reliable fashion. Dial tone governance is all about assurance, but the key word for me in the previous sentence is “intended purpose.”

Dial tone governance is silent on “intended purpose” which provides opportunities for platitudes to fetser and governance becoming a self fulfilling prophecy.

and finally for 2012…

So, all of this leads to a really important question. If most people do not care about dial tone governance, then what do they care about?

As it happens, I’m in a reasonable position to be able to answer that question as I’ve had around 200 people around the world do it for me. This is because in my SharePoint Governance and Information Architecture Class, the first question I ask participants is “What is the hardest thing about SharePoint delivery?”

The question makes a lot of sense when you consider that the hardest bits of SharePoint usually translate to the highest risk areas for SharePoint. Accordingly, governance efforts should be focused in those areas. So in the next post in this series, I will take you through all the answers I have received to this question. This is made easier because I dialogue mapped the discussions, so I have built up a nice corpus of knowledge that we can go through and unpack the key issues. What is interesting about the answers is that no matter where I go, or whatever the version of SharePoint, the answers I get have remained extremely consistent over the years I have run the class.

Thanks for reading…

 

Paul Culmsee

p.s I am on vacation for all of January 2013 so you will not be getting the next post till early Feb

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Dec 10 2012

Confessions of a (post) SharePoint Architect: A pink box called chaos…

Send to Kindle

Hi all and welcome back to my ever growing series which attempts to codify a lot of learning over a long period of time into something that I hope is readable, rigorous and is useful to anyone tasked with successful SharePoint project delivery. This is the 7th post in a what is turning out to be a large series. It is large for a good reason: SharePoint is complex and the problems it attempts to solve (collaboration) are complex as well. If this is your first article, I super-strongly suggest you take it from the beginning as these articles build on each-other.

My motivation for getting this stuff out there is to tap into the shift I now sense in how organisations approach the SharePoint platform. Through a combination of organisations living through SharePoint project failure, practitioners experiencing SharePoint fatigue syndrome, as well as the strength and congruence of the messages of many wise people in the SharePoint community, organisations now realise that SharePoint is a “different” kind of project.  This realisation represents the first stage of any form of learning where people have moved from unconscious incompetence to conscious incompetence.  These terms sound rather confronting but are common in training-speak…

  • Unconscious incompetence refers to people who do not realise they lack certain skills and therefore don’t realise they need training to address the gap. That explains pretty much every IT centric SharePoint deployment based on the “built it and they will come” delivery model (and we all know how much fun those projects are).
  • Conscious incompetence means that people now see the gap between their knowledge and know they need help. Much of my company’s work is in this area – as is our close friends at Dynamic Owl and 21apps. Aside from our own clients, all of us are often sought out by other Microsoft partners who have learnt the hard way that the classic model of sales guy, project manager, business analyst and developer doesn’t always crack the SharePoint nut.

So newfound realisation among clients and consultants is there and that’s all well and good. Now the issue is to get past the cover story and take it to the next level. We have to go beyond the oft-used hippie clichés like “It’s all about the business, man” and make the art of SharePoint delivery real, pragmatic, rigorous and tangible. This series aims to be my attempt to do just that, complimenting leaders like Andrew Woodward, Ruven Gotz, Michal Pisarek and Sue Hanley. To that end, as I continue writing the series I hope that you:

  • laugh at the various truths behind the various SharePoint governance f-laws;
  • smile knowingly at the folly of some of the elaborate project and governance rituals you have to do now;
  • have your own biases challenged as you either cringe in embarrassment or think “he has gone too far with *that* comment”; and
  • have enough solid ammo to get through to influence other key stakeholders in your organisation

Allrighty then! Let’s get down to business. Our f-law for today’s article comes from Woody Allen. I have never actually watched one of his movies, but I have to give him credit for this pearl of wisdom…

F-Law 5: Confidence is the feeling you have until you understand the problem

Most projects start with a honeymoon phase. A newly formed team gets to deliver some new technology that is high profile and bolster their CV’s while “taking the business to the next level” (platitude black belts take note!) Morale is high and the team feel the sort of excitement one feels when going on a first date. Like a bad first date however, it doesn’t take long for the slow but relentless imposition of reality to take hold. Accordingly, as understanding of the problem grows, uncertainty grows commensurately. This in turn tests the initial project assumptions which an optimistic budget was likely pinned to. Most people can’t handle this sort of uncertainty because it confers risk of blame – something we all seek to avoid if we can. Thus on a complex project where the problem has elements of wickedness, blame avoidance results in things becoming quite dysfunctional and often project teams lose confidence that they can solve the problem.

There is an underlying phenomenon at work here that seems to be part of the human condition. Check out these two diagrams below as both of them show the same pattern. The left image is by Gartner and is their famous hype cycle that they pin technology fads to. The other won a Nobel prize for the originators and refers to a phenomenon called the Dunning-Kruger effect.

Gartner_Hype_Cycle_svg    Snapshot

The diagrams should be self explanatory as both are a representation of increasing ones understanding of a problem over time. Both diagrams powerfully illustrate that as understanding grows, one never regains the same level of confidence that one had at the start! Take a look at the red line which reflects level of understanding of a problem. In both cases, the red line never reaches the same peak as it did at the very start when confidence was high and understanding was low. In other words, as your understanding grows and you become more informed about a problem, you will never be as certain as you would like to be.

Now in my mind, anyone who tries to argue against truth of the above patterns have fallen victim to the pattern. Furthermore, if you are dealing with someone who fits that level of optimistic naivety like a command-and-control project manager or CIO, just tell them that this has Nobel prize winning backing. For those CIO types who get all of their gospel from Gartner, use the hype cycle instead. After all, what would those Nobel dudes know eh? :-)

So here is a tip. Next time you are kicking off a SharePoint project and need to assess risk, try this: First up, explain platitudes as described in the last two posts. Then draw one of the above diagrams on a whiteboard and ask your stakeholders to place an X on the above diagrams where they see the project team, themselves and where they see others! I guarantee much fun and frivolity will ensue…

Divergence and Convergence

Now if you work for an organisation where the idea of ranking ones naivety is a bit confronting, let me offer you something gentler. This alternate way of looking uncertainty over time is similarly powerful to the images above. I first saw this diagram used by Jeff Conklin, who got it from a book by Sam Kaner called the Facilitator’s Guide to Participatory Decision-Making. My diagram below was modified from Kaner for my own purposes.

image

Like the Gartner and Dunning-Kruger diagrams above, the X axis represents time, with a problem at one end and the solution on the other. The Y-axis represents the level of uncertainty and it illustrates that project teams typically go through a cycle of divergent thinking, followed by a period of convergent thinking as the team becomes more aligned and the problem is better understood. What is interesting to note is that the point where divergence ends and convergence starts is never clear. No-one ever stops and pronounces “Okay guys, I think we have sufficiently diverged. Let’s converge now.”

To converge, one has to cross over the ‘hump’ of divergence. Imagine climbing a mountain and there are thick clouds that obscure the peak. For all you know, the peak might be a couple hundred feet further, but equally so, you might only be halfway up. For this reason I draw a box in the middle rather than connect the arrows. it is important to note that when I draw the box in the middle, I make a point of asking people to tell me the sort of words they would use to convey what they are feeling during this time. Without fail, I always get negative responses like “confusion,” “irritability,” “stress” and “uncertainty.”

Now consider this: Some projects tend to diverge sharply and convergence seems almost impossible. No attempts to reign things in by asserting control seem to work. In fact, they usually make things worse. Accordingly, SharePoint projects commonly look like the figure below. They are highly divergent with little convergence as a result of the varied implicit assumptions that stakeholders have about SharePoint that have not been aired and reconciled. The power of those pesky platitudes, eh?

image

When I show this version of the diagram to people, I always ask a simple question:

  • If you do not have governance for SharePoint, what do you have?

The answer I get is always “Chaos,” which I write in the box as you can see above. My next question to the group is this:

  • “So by definition, to understand SharePoint governance, we all have to metaphorically open this box we have labelled “Chaos” to understand the forces that create the divergence?

So far, nobody has disagreed with that logic. So I then I hit people with the punch line…

  • “So how can you tell me that your governance approach is addressing the forces of divergence if you don’t know what’s in the box?“

That is usually the great silence moment… Despite the logic of my argument, most organisations never open the damn box and then approach SharePoint project delivery in a manner that is very likely to exacerbate the problem, rather than address it.

False convergence…

Check out the figure below. It is a variation of the divergence/convergence figure and represents a common approach to rein in SharePoint going haywire. If you look closely, you can see that an attempt has been made to force convergence. This manifests in different ways, but is most often the scenario when a sponsor or key stakeholder starts to make key decisions on behalf of others. In the short term, this approach tends to feel good because there is a sense of momentum and something is “being done” to get things under control. Project managers stop palpitating for a time because their Gantt charts start to see some progress…

image

After explaining this diagram, I then ask my audience. “So has this dealt with divergence?”. To this day, not a single person I have ever asked has said yes. In fact, everyone implicitly seems to realise that this is a false convergence and the underlying divergent forces have not truly been addressed at all. There is usually a short term feeling that things are getting back on track, but it doesn’t last long because it is actually little better than an illusion and things starts to get fishy – both on the project and in my diagram as shown below…

image

So eventually we will be smacked by the chaos baseball bat whether we like it or not. Despite this, many organisations will persist with the forced convergence approach many times (with a different set of consultants each time) and of course continue to get crappy results. Eventually though, the attrition of this approach will exceed the commitment of stakeholders and something gives. It is at this point where some organisations react by doing another dumbass thing…

Overly constraining divergence…

Once there is a realisation that an elite coterie (like the IT department or a single champion in the business) cannot solve the information management problems for the entire organisation via false convergence approaches, the next approach seems to artificially constrain the divergence via controlling the terms of reference – aka lock the crap out of the scope. This is seductively tempting since scope creep is the quintessential symptom of stakeholders who are still in divergent thinking.

While I have no problem with determining an appropriate scope, as we all operate within time, people and budgetary limits. But it has to be done for the right reasons and constraining divergence is not a good reason. Why? Because it means that stakeholders have very little shared understanding at the point scope is decided. This is a problem because ideally, divergent thinking should be reconciled to be able to decide on scope. From a diagrammatic point of view, this is like putting clamps around the level of allowable uncertainty. The classic example of this approach is when an organisation opts for the installation and deployment of SharePoint itself as phase 1 of the project. Ever done that before? Smile

image

Like the previous example, I ask my audience whether this approach deals with divergence. Also like the first example, the answer is a universal no. The underlying divergent forces have not truly been addressed at all and things are still fishy – although on the diagram below it is a difference species of fish! :-) In fact what you are doing here is penalising people for their learning – something I warned against doing in part three.

image

Key takeaways

I hope that you find these diagrams useful when discussing SharePoint delivery with your own stakeholders. By explaining SharePoint delivery in terms of divergence, convergence and a pink box labelled “chaos” we are able to provide a frame to show why artificially constraining divergence often has the opposite effect of what is intended. It is also worth pointing out that both of the above approaches are not particularly collaborative either – which tends to go against what one is trying to do with SharePoint in the first place.

Many SharePoint projects proceed on the assumption that the problem is well understood – that divergence has peaked and we are heading down the slope of convergence. If this is indeed the case, then the SharePoint project should go reasonably well since all of the tools for managing and delivering projects are convergence tools and do a good job in assisting this process. When this is not the case however, those same approaches have a bad habit of getting in the way by precluding the sort of learning and exploration needed for stakeholders to align around a problem. Returning to my mountain climbing metaphor I used earlier, these tools are like gravity assist to help you get down the mountain, but they weight a lot when you are trudging up a steep mountain and when clouds are obscuring the peak.

My takeaway for f-law 5 is not to jump straight to convergence. It might give you an initial sense of certainty and momentum, but only for a short time. I have said it many times and I will say it again. While there is a lack of shared understanding among participants of a problem, you will never get the shared commitment you need to see a solution through. Shared commitment is critical because without it, projects lose their energy and momentum to be seen to conclusion. Persistent divergence is a sign of a lack of shared understanding so the trick of course, is to harness divergence and turn it into something positive. Create the conditions that allows for some uncertainty, reduce the blame culture and tolerate mistakes. Invest in tools and methods that allow collective sensemaking and give people safety and structure to raise and reconcile their concerns.

Achieving shared understanding of the problem is for me the essence of SharePoint governance. In the doctors vs. midwives post in this series I explained how the goal of an architect is to create the right conditions for SharePoint success. The conditions to manage and harness divergence is a critical skill.

If you can achieve this end you should be bloody proud of yourself – as you have done 80% of the work of SharePoint delivery already.

Thanks for reading

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Dec 02 2012

Confessions of a (post) SharePoint architect: Black belt platitude kung-fu

Send to Kindle

Hello kung-fu students and thanks for dropping by to complete your platitude training. If you have been dutifully following the prior 5 articles so far in this series, you will have now earned your yellow belt in platitude kung-fu and should be able to spot a platitude a mile away. Of course, yellow belt is entry level – like what a Padwan is to a Jedi. In this post, you can earn your black-belt by delving further into the mystic arts of the (post) SharePoint architect and develop simple but effective methods to neutralise the hidden danger of platitudes on SharePoint projects.

If this is your first time reading this series, then stop now! Go back and (ideally) read the other articles that have led to here. Now in reality I know full well that you will not actually do that so read the previous post before proceeding. Of course, I know you will not do that either, so therefore I need to fill you in a little. This series of articles outline much of what I have learnt about successful SharePoint delivery, strongly influenced from my career in sensemaking. I have been using Russell Ackoff’s concept of f-laws – truth bombs about the way people behave in organisations – to outline all of the common mistakes and issues that plague organisations trying to deliver great SharePoint outcomes.

So far in this series we have explored four f-laws, namely:

In the last post, we took a look at the danger of conflating a superlative (like biggest, best, improved and efficient) with a buzzword like (search, portal, collaboration, social). The minute you combine these and dupe yourself into thinking that you now have a goal, you will find that your project starts to become become complex, which in turn results in over-engineered solutions solving everything and anything, and finally your project will eventually collapse under its own weight after consuming far too many financial (and emotional) resources.

This is because the goal you are chasing looks seductively simple, but ultimately is an illusion. All of your stakeholders might use the same words, but have very different interpretations of what the goal actually looks like to them. The diagram that shows the problem with this is below. On the left is the mirage and to the right is the reality behind the mirage. Essentially your fuzzy goal actually is a proxy for a whole heap of unaligned and often unarticulated goals from all of your stakeholders.

Snapshot   Snapshot

Now in theory, you have read the last post and now have a newly calibrated platitude radar. You will sit at a table and hear platitudes come in thick and fast because you will be using Ackoff’s approach of inverting a goal and seeing if a) the opposite makes any logical sense and b) could be measured in any meaningful way. As an example, here are three real-world strategic objectives that I have seen adorning some wordy strategic plans. All three set off my platitude radar big time…

“Collaboration will be encouraged”

“A best-practice collaboration platform”

“It’s a SharePoint project” Smile

I look at the first statement and think “so… would you discourage collaboration? Of course not.” Ackoff would take a statement like that and say “Stop telling me what you need to do to survive, and tell me what you need to do to thrive”.

What do you mean by?

So if I asked you how to unpack a platitude into reality, what might you do?

For many, it might seem logical to ask people what they really mean by the platitude. It might seem equally logical to come up with a universal definition to bring people to a common understanding of the platitude. Unfortunately, both are about as productive as a well-meaning Business Analyst asking users “So, what are your requirements?”

With the “what do you mean by [insert platitude here]” question, the person likely won’t be able to articulate what they mean particularly well. That is precisely why they are unconsciously using the platitude in the first place! Remember that a platitude is a mental shortcut that we often make because it saves us the cognitive effort of making sense of something. This might sound strange that we would do this, but in the rush to get things done in organisations, it is unsurprising. How often do you feel a sense of guilt when you are reflecting on something because it doesn’t feel like progress? Put a whole bunch of people feeling that way into a meeting room and of course people will latch into a platitude.

By the way, the “mental shortcut” that makes a platitude feel good seems to be a part of being human and sometimes it can work for us. When it works, it is called a heuristic, When it doesn’t its called a cognitive bias. Consult chapter 2 in my book for more information on this.

Okay, so asking what someone means by their platitude has obvious issues. Thus, it might seem logical that we should develop a universal definition for everyone to fall in behind. If we can all go with that then we would have less diversity in viewpoints. Unfortunately this has its issues too – only they are a little more subtle. As we discovered in part 2 of this series appropriately entitled “don’t define governance”, definitions tend to have a limited shelf life. Additionally, like best-practice standards, there are always lots of them to choose from and they actually have an affect of blinding people to what really matters.

So is there a better way?

It’s all in the question and its framing…

If there is one thing I have learnt above all else, is that project teams often do not ask the right question of themselves. Yet asking the right question is one of the most critical aspects to helping organisations solve their problems. The right question has the ability to cast the problem in a completely different light and change the cognitive process that people are using when answering it. In other words, the old saying is true: ask a silly question, get a silly answer.

Let me give you a real life example: Chris Tomich is a co-owner of Seven Sigma and was working with some stakeholders to understand the rationale for how content had been structured in a knowledge management portal. Chris is a dialogue mapper like me – and he’s extremely good at it. One thing Dialogue Mapping teaches you is to recognise different question types and listen for hidden questions. The breakthrough question in this case when he got some face time with a key stakeholder and asked:

  • What was your intent when you designed this structure for your content?

The answer he got?

  • “Well, we only did it that way because search was so useless”
  • “So if I am hearing you, you are saying that if search was up to scratch you would not have done it that way”
  • “Definitely not”

Neat huh? By asking a question that took the stakeholder back to the original outcome sought for taking a certain course of action, we learnt that poor search was such a constraint they compensated by altering page template design. Up until that point, the organisation itself did not realise how much of an impact a crappy search experience had made. So guess where Chris focused most of his time?

In a similar fashion, my platitude defeater question is this:

So if we had [insert platitude here], how would things be different to now?

Can you see the difference in framing compared to “what do you mean by [insert platitude here]?”. Like Chris with his “What was your intent”, we are getting people to shift from the platitude, to the difference it would make if we achieved the platitude. No definitions required in this case, and the answer you will get almost by definition has to be measurable. This is because asking what difference something would make involves a transition of some kind and people will likely answer with “increased this”  or “decreased that”.

Now be warned – a hard core middle manager might serve you up another platitude as an answer to the above question. To handle this, just ask the question again and use the new platitude instead. For example:

  • Me: Okay so if you had improved collaboration, how would things be different to now?
  • Them: We would have increased adoption
  • Me: And what difference would that make to things?

I call this the KPI question because if you keep on prodding, you will find themes start to emerge and you will get a strong sense of potential Key Performance Indicators. This doesn’t mean they are the right ones, but now people are thinking about the difference that SharePoint will make, as opposed to arguing over a definition. Trust me – its a much more productive conversation.

Now to validate that these emerging KPI’s are good ones, I ask another question, similarly framed to elicit the sort of response I am looking for…

What aspects should we consider with this initiative to [insert platitude here]?

This question is deliberately framed as neutral is possible. I am not asking for issues, opportunities or risks, but just aspects. By using the term aspects I open the question up to a wider variety of inputs. Like the KPI question above, it does not take long for themes to emerge from the resulting conversation. I call this the key focus area question, because as these themes coalesce, you will be able to ensure your emerging KPI’s link to them. You can also find gaps where there is a focus area with no KPI to cover it. As an added bonus, you often get some emergent guiding principles out of a question like this too.

The thing to note is that rather than follow up with “what are the risks?” and “what should our guiding principles be?”, I try and get participants to synthesize those from the answers I capture. I can do this because I use visual tools to collect and display collective group wisdom. In other words, rather than ask those questions directly, I get people to sort the answers into risks, opportunities and principles. This synthesis is a great way to develop a shared understanding among participants of the problem space they are tackling.

If we were unconstrained, how would we solve this problem?

This is the purpose question and is designed to find the true purpose of a project or solution to a problem. I don’t always need to use this one for SharePoint, but I certainly use it a lot in non IT projects. This question asks people to put aside all of the aspects captured by the previous question and give the ideal solution assuming that there were no constraints to worry about. The reason this question is very handy is that in exploring these “pie in the sky” solutions, people can have new insights about the present course of action. This permits consideration of aspects that would not otherwise be considered and sometimes this is just the tonic required. As an example, I vividly recall doing some strategic planning work with the environmental division of a mining company where we asked this exact question. In answering the question, the participants had a major ‘aha’ moment which in turn, altered the strategy they were undertaking significantly.

Note: If you want some homework, then check Ackoff’s notion of idealised design and the Breakthrough Thinking principle called the purpose principle. Both espouse this sort of framed question.

Sharpening the saw…

Via  the use of the above questions, you will have a  better sense of purpose, emergent focus areas and potential measures. That platitude that was causing so much wheel spinning should be starting to get more meaty and real for your stakeholders. For some scenarios, this is enough to start developing a governance structure for a solution and formulating your tactical approaches to making it happen. But often there is a need to sharpen the saw a bit and prioritise the good stuff from the chaff. Here are the sort of questions that allow you to do that:

No matter what happens, what else do we need to be aware of?

This question is called the criterial question and I learnt it when I was learning the art of Dialogue Mapping. When Dialogue Mapping you are taught to listen for the “no matter what…” preamble because it surfaces assumptions and unarticulated criteria that can be critical to the conversation and will apply to whatever the governance approach taken. Thus I will often ask this question in sessions, towards the end and it is amazing what else falls out of the conversation.

What are the things that keep you up at night?

I picked this up from reading Sue Hanley’s excellent whitepaper a while back and listening to hear speak at Share2012 in Melbourne reminded me why it is so useful. This question is very cleverly framed and is so much better than asking “What are your issues?”. It pushes the emotive buttons of stakeholders more and gets to the aspects that really matter to them at an gut level rather than purely at a rational level. (I plan to test out dialogue mapping a workshop with this as the core question sometime and will report on how it goes)

What is the intent behind [some blocker]?

This is the constraint buster question and is also one of my personal favourites. If say, someone is using a standard or process to block you with no explanation except that “we cannot do that because it violates the standards”, ask them what is the intent of the standard. When you think about it, this is like the platitude buster question above. It requires the person to tell you the difference the standard makes, rather than focus on the standard itself. As I demonstrated with my colleague Chris earlier, the intent question is also particularly useful for understanding previous context  by asking users to outline the gap between previous expectation and reality.

Conclusion…

To there you go – a black belt has been awarded. Now you should be armed with the necessary kung-fu skills required to deflect, disarm and defeat a platitude.

Of course, knowing the right questions to ask and the framing of them is one thing. Capturing the answers in an efficient way is another. For years now, I have advocated the use of visual tools like mind mapping, dialogue mapping and causal mapping tools as they all allow you to visually represent a complex problem. So as we move through this series, I will introduce some of the tools I use to augment the questions above.

Thanks for reading

 

Paul Culmsee

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Nov 13 2012

Confessions of a (post) SharePoint architect: Yellow belt platitude kung-fu

Send to Kindle

Hi all. It’s been a while I know, but it is time to continue along my journey of confessions of a post SharePoint architect. As I write this, the SharePoint community is in Vegas, soaking up the love of the biggest SharePoint conference yet. For the other twelve SharePoint professionals who are not there, I know your pain.

If this is your first  foray into this series of articles, consider it the closest I will ever get to a SharePoint governance book. Since all new knowledge is gained through the lens of existing knowledge, it is important to note that my world view has been shaped by the increasing amount of non IT work I do in various complex problem solving areas. Essentially this work has had a major effect on the lens that I view SharePoint projects and the approaches I use to steer them. When developing a class on the topic of SharePoint Governance and Information Architecture, I found a fun yet effective way to put coherence around things via Russell Ackoff’s concept of f-laws. These are simple home truths about the way people behave in organisations than explain much more than the complex ones proposed by theorists of various persuasions.

So far in this series we have explored three f-laws, namely:

The next f-law is straight from Ackoff himself and is a universal truth in any project, but absolutely chronic in SharePoint projects as well as many SharePoint slide decks:

F-Law 4: Most  SharePoint governance objectives are platitudes. They say nothing but hide behind words

Most people in the IT industry (with the obvious exclusion of sales guys, Office365 MVP’s and Google employees) tend to inwardly cringe or outrightly roll their eyes when the word “cloud” is uttered during conversation. This is because people instinctively know that what follows is either:

  • Gushing hyperbole squarely aimed at getting you to part with some cash
  • Gushing hyperbole squarely aimed at convincing you that they know what they are talking about
  • Gushing hyperbole in the form of FUD laden counter-arguments from server hugging sysadmins who reject “cloud” outright because they fear irrelevance

These reactions in such conversations result from the term “Cloud” being used in a platitudinal sense. In case you were not aware, a platitude is referred to as “a trite, meaningless, biased, or prosaic statement, often presented as if it were significant and original”. Platitudes are everywhere and usually unavoidable since many people use them unconsciously – especially politicians, senior executives and the aforementioned sales guys. Want proof? Just look on the wall behind the reception area of your office – chances are there is a mission statement there somewhere that would read something like:

“We are dedicated to ensuring a long-term commitment to stakeholder value from performance and improved returns at all levels.”

Does a mission like that sound familiar? What if I told you the line above was generated from a website that generates mission statements like a poker machine. Just pull the lever and within a few seconds, a random assortment of small quotes are mashed together to create a cool sounding sentence. If you enter your company name into it, you can even print a certificate.  I generated this  one for the Heretic’s Guide book. Neat, huh?

clip_image002

The problem, of course, with a platitude is that while it sounds significant, it doesn’t actually tell you much at all. So this f-law states that most SharePoint governance objectives are platitudes. One of the core reasons for this is a side effect of Microsoft’s marketing approach. Consider Microsoft’s SharePoint marketing material as it has evolved. Take a look at how many words survived the transition from Microsoft’s SharePoint pie of 2007, the Frisbee of 2010 and now the square of 2013. Do you see a pattern? What is the average shelf life of a word in each of these diagrams?

image image

Now, let me start out by stating that I have no problems with any of these diagrams. Microsoft is perfectly entitled to develop the message it wants to convey in whatever means it sees fit. The biggest travesty is when people frame the above words as deliverables. They take a superlative word like “improved”, “best practice” or “effective” and then add one of the above words to it. This combination inevitably forms the basis for the justification of putting SharePoint in.

The classic example that still pervades SharePoint projects to this day is the perennial mirage of “Improved Collaboration.” If we return to the “here” and “there” diagrams of the previous posts, it looks like this… note the aspiration goal has a happy smiley on it!

Snapshot

Platitude detection 101

So the first thing you have to do as a SharePoint architect or practitioner is to develop a finely tuned platitude radar. The thing to be aware of is that platitudes come in many forms – some which are obvious and some much more subtle. Thus we will start your platitude radar calibration via a quick and easy method that Ackoff came up with. Years ago, Russel Ackoff critically examined mission statements and said that a mission statement that merely restates the obvious does not say anything that is truly aspirational. To quote from Ackoff:

They often formulate necessities as objectives: For example, ‘to achieve sufficient profit’. This is like a person saying his mission is to breathe sufficiently.

Ackoff’s test to judge the quality of a mission statement was to inverse the statement and see it still made logical sense. If you could not reasonably disagree with this negative version, then the original statement was a platitude. As an example, consider this mission statement from a well-known global organisation:

… our mission and values are to help people and businesses throughout the world realize their full potential.

So, our inverse here is that we are working to hinder people and businesses to realise their full potential. Who in the hell would ever do that? Well – given this is Microsoft’s mission statement, suddenly Windows Vista is finally starting to make sense to me. Smile

Now, go and take any word from the 3 diagrams above and put a superlative like “biggest”, “best”, “optimum” or “improved” in front. If I use my example – “Improved Collaboration” – Ackoff’s inversion approach results in “Worse Collaboration” and is therefore a platitude. I mean – aside from the odd command and control boss, would anybody seriously want to make collaboration less effective?

So, to put it simply – stop stating the bloody obvious! If your SharePoint goal doesn’t satisfy Ackoff’s simple platitude test, you have a problem.

The seeds of doom are sown at the start…

Now that I have wired up your platitude detector via Ackoff’s inversion test, you will start to notice how utterly pervasive they are in SharePoint projects and beyond. As Kailash and I state in the Heretics Guide book:

A platitude is a mental shortcut we take, a deceptively quick way to cut through uncertainty. We clump our unclear, unarticulated aspirations in a bunch of platitudes. It is easy to do and it gives us a sense of achievement. But it is a mirage because the objective is not clear and we cannot define sensible measurements of success if the goal is fuzzy. It never fails to amaze us that many organisational endeavours are given the go-ahead on the basis of platitudinous goals. Mind-boggling, isn’t it?

What is really amazing and sad at the same time is how badly the platitude problem is misattributed. One of my students at a recent SPGovIA class said that with SharePoint projects “the seeds of doom are sown at the very beginning”. He’s right too…project teams will commit significant time and money into a project that is chasing a platitude, and when things inevitably go haywire, will blame the process, methodology, people… everything but the mirage at the root of it all.

The seduction of a platitude is strong. Many have been entranced by some nice sounding desirable future state incorporating some superlative like “Improved quality” or “Best practice collaboration”. But the key point is this: The platitude becomes a sort of proxy for the end in mind rather than the real end. We have no shared understanding of what where we want to get to. Empty words preclude a shared understanding because they mean nothing at all.

The image below illustrates the effect of a platitude being confused with the actual desired state. We do not have an aspirational future state at all. Instead, we have many possible, fuzzily-defined future states.

Snapshot

If you look really closely, the future state is a sad smiley. This is because the visible symptoms of a project with a divergent understanding among participants are well documented. Scope creep and vague requirements mean that the project will start to unravel, yet the platitude-driven journey towards the mirage will continue. The project will lurch from crisis to crisis, with scope blowing out, tensions and frustrations rising. This is accompanied by classic blame-shift or hind-covering moves that people make when they realise that their ship’s taking water.

How to defeat a platitude…

I am going to conclude this post at this point because it is starting to get too long. But I will leave you with a teaser about the next post…

One of the most important things about dealing with a platitude is knowing what *not* to do. I know that what I am about to say may sound counter intuitive to many readers, but trust me when I tell you that there are two things you should avoid doing.

  1. Never, never, never ask someone what they mean when they use a platitude
  2. Never try and come up with a definition for a platitude

In the next post, I will elaborate on these two contentions and provide you with a much better way to get past the seductive danger of platitudes, so you can find out what really matters to your stakeholders.

Until then, thanks for reading…

 

Paul Culmsee

www.sevensigma.com.au

Falling Books Stack

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Sep 08 2012

Confessions of a (post) SharePoint Architect: The self-fulfilling governance prophecy

Send to Kindle

Hi and welcome to another SharePoint (post) architect confessional post. In case you are here via the good grace of whatever Google’s search relevance algorithm feels like doing today, I need to give you a little context to this post and the larger series of which it is a part. These days, I spend a lot of time working on projects beyond the cloistered confines of SharePoint; in fact, beyond the confines of IT altogether. Apart from being a cathartic release from SharePoint work, I’ve had the privilege to work with various groups on solving some very complex problems in a collaborative fashion. As a result of these case studies, I’ve become a bit of a student of various collaborative problem solving approaches and recently released a business book on the subject called “The Heretics Guide to Best Practices” co-written with mild-mannered mega-genius Kailash Awati. Despite (or because of) the book having no absolutely SharePoint content whatsoever, it managed to win an Axiom Business Book award and I feel it’s indirectly a good SharePoint governance book in its own right.

Now, for the rest of you who have been following my epic rant thus far, you will now be familiar with the notion of Ackoff’s f-laws: “truths about organisations that we might wish to deny or ignore – simple and more reliable guides to everyday behaviour than the complex truths proposed by scientists, economists, sociologists, politicians and philosophers.” Via the f-law metaphor, you now also understand why midwives are more valuable than doctors, the word “governance” should not be defined if you actually want people to understand it and that people should not be penalised for their learning.

The next f-law that we will explore provides an explanation to why organisations so consistently and persistently apply the wrong approaches to SharePoint-type projects. IT departments have genetic predisposition to falling into this trap, as do other service delivery departments such as Finance and HR when they put in ERP systems. To explain my assertion, we are going to revisit the governance diagram that I used in the first f-law. You can see it below:

I used the above diagram to to explain f-law 1 which was “The more comprehensive the definition of governance is, the less it will be understood by all”. The above diagram serves to point out that governance is not the end in mind, but the means by which you achieve a desirable future state. Without any context to an end in mind, we have to accommodate many vague potential ends. To deal with this uncertainty, we inevitably look to definitions to provide clarity about what governance means. Unfortunately, this form of “definitionisation” tends to confuse more than clarify because it sneakily starts to drive the end, rather than be the means. This inevitably results in over-engineered, over-complicated and likely inappropriate governance approaches that do more harm than good.

It should be noted that “governance” is by no means the only word that falls into this trap. Words like quality, effective, “best practice” and even “SharePoint” should all be put in the green star above too because all of these words have no inherent meaning until they are applied to a given situation or context. This point is echoed by people like Andrew (“SharePoint by itself knows nothing”) Woodward, Dux (“SharePoint doesn’t suck – you suck.”) Sy and Ruven (“Can I use this diagram in my Information Architecture book?”) Gotz.

To that end, our next  f-law expands on this notion of means vs. ends and provides you with a practical way to assess the clarity of a SharePoint goal or outcome.

F-Law 3: The probability of SharePoint success is inversely proportional to the time taken to come up with a measurable KPI

Hmm… f-law 3 is a mouthful isn’t it. For a start I used the acronym of “KPI”, which in case you are not aware, stands for Key Performance Indicator – something that we can measure to visibly demonstrate that we have not sucked and actually achieved what we have set out to do. In essence, this f-law states that the longer it takes to determine a reasonable and measurable indicator that SharePoint has been a success, the less likely your SharePoint project is to succeed.

To demonstrate this, I am going to give you one of my patent pending techniques that is highly useful in client engagements to get people to think a little differently about their approach. Let’s reuse my “from here to there” diagram above to perform a basic experiment. Check out the project below and tell me … what project is this?

image

Hopefully, it did not take you long to work out that this project is the Apollo moon missions. Now, for the experimental bit. Grab a stopwatch, start the clock and answer this question:

“How do you know you have succeeded with this project?”

Once you have your answer, stop the clock and note the time. I’m willing to bet that you gave one or two answers:

  • You successfully landed a person on the moon
  • You got the person back to Earth again

I am also willing to bet that you worked out that answer within 2 to 15 seconds of pondering my diagram. Am I right?

Now, consider for a moment the sheer scale of of this project in terms of size, risk, innovation and level of expertise required to land a person on the moon and bring them back safely. Imagine the sheer number of projects within multiple programs of work that had to be aligned. Imagine the tens of thousands of people who directly and indirectly worked on this epic project. It is mind boggling when you think about it and it is little wonder that putting a man on the moon is regarded one of mankind’s greatest technical achievements.

And then we have SharePoint…

Now let’s contrast the moon project with another one likely to be very familiar with readers. So once again, tell me what project this is…

image

This one takes some people a bit longer to answer, but when I ask this in workshops and conferences I sometimes get people jokingly saying “my SharePoint project!” or “a nightmare.” So once again I want you to answer the following question:

“How do you know you have succeeded with this project?”

I bet this one has you a little more stumped and is much harder to answer than the moon example above. What is funny with this one is that, when you consider that in terms of scope and size, using SharePoint to improve collaboration is a mere pimple on the butt of sending a rocket to the moon. Yet, despite the moon example being much larger in scope, cost, degree of innovation and engineering, the success criteria is clear and unambiguous to all. People can identify what success looks like very quickly. No-one will point to Venus and say “I think that’s the moon.”  You either got there or you didn’t.

Yet, when I show a SharePoint project that is framed like the above example, people have a much (much) harder time describing what success would look like. In fact, I have asked this question many times around the world and most of the answers I am offered do not hold up to any serious form of scrutiny. Consider these common suggestions of SharePoint success and my response to them:

  • “People are using it.” My response: “Yeah, but people use email and the file system now, so why are you putting SharePoint in?”
  • “People are happy.” My response: “I bet if I replaced the crappy coffee with a top of the range espresso machine I could make people really happy and it’s a fraction of the cost of SharePoint.”

Sorry folks, but this isn’t good enough… in fact it’s a recipe for a situation where, in the name of “governance,” you deliver a bloated, over-engineered failure.

When problems are complicated…

My two project examples above highlight a particular characteristic of problems that is at the root of the difference between the moon and SharePoint example. Consider the following common IT projects:

  • Replacing your old email system with Microsoft Exchange
  • Consolidating Active Directory
  • Replacing your old phone system with Voice over IP system
  • Upgrading your storage area network  to new infrastructure

All of these are like the moon example. None of them are easy – in fact you need specialist expertise to get them successfully implemented. But when you put each of these in the green star of my “here to there” picture, criteria for success is fairly clear and unambiguous. For example, if email comes in and goes out of everyone’s inboxes, Exchange is a usually success. If you can pick up the phone, get a dial tone and make a call, then the VOIP upgrade has been a success.

These are all examples of complicated problems. With complicated problems, the criteria for success is clear and unambiguous and there is a strong relationship between cause and effect. You can be highly confident that doing X will lead to Y. In these sorts of problems, experts can come together and analyse the problem by breaking the problem down neatly into its parts to develop a high-confidence solution. Furthermore, there are likely to be many best practices that have emerged from years of collective wisdom of implementing solutions because of that relationship between cause and effect.

Wouldn’t it be nice if reality was always like this. Project Managers and tech people would actually get on with each other! But of course, reality paints a different picture…

When complicated approaches fail…

In a 2002 discussion paper about reform of the Canadian health system, authors Sholom Glouberman and Brenda Zimmerman make a statement that is completely applicable to how most organisations approach SharePoint:

In simple problems like cooking by following a recipe, the recipe is essential. It is often tested to assure easy replication without the need for any particular expertise. Recipes produce standardized products and the best recipes give good results every time. Complicated problems, like sending a rocket to the moon, are different. Formulae or recipes are critical and necessary to resolve them but are often not sufficient. High levels of expertise in a variety of fields are necessary for success. Sending one rocket increases assurance that the next mission will be a success. In some critical ways, rockets are similar to each other and because of this there can be a relatively high degree of certainty of outcome. Raising a child, on the other hand, is a complex problem. Here, formulae have a much more limited application. Raising one child provides experience but no assurance of success with the next. Although expertise can contribute to the process in valuable ways, it provides neither necessary nor sufficient conditions to assure success. []

In this paper we argue that health care systems are complex, and that repairing them is a complex problem. Most attempts to intervene [] treat them as if they were merely complicated. [] We argue that many of these dilemmas can be dissolved if the system is viewed as complex.

The key point in the above quote is that the tools and approaches that work well with complicated problems actually cause a lot of trouble in complex problems, where certainty of an outcome is much less clear. My point is that while the notion of using SharePoint to get from “poor collaboration” to “improved collaboration” might seem logical on the surface, it is hard to come up with any sensible criteria for success. Therefore you are setting yourself up for a fail because you have made SharePoint take on the characteristics of complex problems. Without unpacking these implicit assumptions about “Improved Collaboration,” our aspirational future state will look like the diagram below. The reality is we have many aspirational future states, all hidden beneath the seductive veneer of “improved collaboration” that in reality tells us nothing.

image

What blows me away is that to this day most project governance material published consistently fail to realise this core issue while trying to treat the very symptoms caused by this issue!  They provide you with the tools, means and methods to chase goals which are little better than an illusion, with no means to measure progress and therefore guide the very decisions that are made in name of governance.

Without unpacking and aligning all of these different future states above, how can any SharePoint architect be sure that they are providing the right SharePoint-based enabler? If you cannot tell me the difference made by implementing a project, how can anyone else know the difference? Even if you can, how do you know that everybody else sees it the same way as you?

Is it little wonder then, that after more than a decade of trying, SharePoint projects (complex problems) continue to go haywire? While approaches to governance force a complicated lens on a complex problem and assume the goal as stated is understood by all, governance itself will be one of the root causes of poor outcomes. Why? Because governance will require people to focus in all the areas except the one that matters. When this gap in focus manifests visibly (for example SharePoint site sprawl), governance is seen as the means to address the gap. Thus governance becomes a self-fulfilling prophecy “We will get it right this time” is the mantra, all the while, we still chase those rainbows of “improved collaboration.”

Conclusion and coming next

I do not recall where I first heard the distinction between “Complicated” and “Complex” problems because I came across it some time after I discovered the term “wicked problem.” I suspect that it was the Cynefin model that first pushed my cognitive buttons on the idea, although I distinctly recall Russell Ackoff also making the distinction between complex and complicated. Irrespective of the source, I find this a hugely valuable frame of reference to examine problems and understand why SharePoint projects are routinely tackled in an inappropriate manner. With it, I have been able to give IT departments in particular, a frame of reference to understand why they have trouble with particular kinds of projects like SharePoint.

Many people in organisations do not discern the difference between a complicated and complex problem and use the tools of the “complicated problem toolkit” to address complex problems when they are inappropriate at best and will kill your project at worst. I will expand on why this happens in the next and subsequent posts. But my key takeaway is that addressing the issue of multiple interpretations of the future is not only the key SharePoint governance challenge, it is the key challenge for any complex project.

The sorts of tools and approaches that are part of the “complex problem” utility belt are numerous and are really starting to gain traction which is great. There is plenty to read on this topic elsewhere on this blog as well as people like Andrew Woodward and Ruven Gotz. The great irony is that if you do manage align people to a shared sense of what the end in mind will look like, things that might have been seen as complex will now become complicated and the traditional tools and approaches will have efficacy because outcomes are clear and the path to get there makes more sense.

In the next post and f-law, I am going to outline another chronic issue that further explains why we get suckered into chasing false goals…

 

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

HGBP_Cover-236x300

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Aug 16 2012

Demystifying SharePoint Performance Management Part 11 – Tales from the Microsoft labs

Send to Kindle

Hi all and welcome to the final article in my series on SharePoint performance management – for now anyway. Once SharePoint 2013 goes RTM, I might revisit this topic if it makes sense to, but some other blogging topics have caught my attention.

To recap the entire journey, the scene was set in part 1 with the distinction of lead and lag indicators. In part 2, we then examined Requests per Second (RPS) and looked at its strengths and weakness as a performance metric. From there, we spent part 3 looking at how to leverage RPS via the Log Parser utility and a little PowerShell goodness. Part 4 rounded off our examination of RPS by delving deeper into utilising log parser to eke out interesting RPS related performance metrics. We also covering the very excellent SharePoint Flavored Weblog Reader utility, which saves a bunch of work and can give some terrific insights. Part 5 switched tack into the wonderful world of latency, and in particular, focused on disk latency. Part 6 then introduced the disk performance metrics of IOPS, MBPS and their relationship to latency. We also looked at typical SharePoint and SQL Server disk IO characteristics and then examined the pros and cons of RPS, IOPS, Latency, MBPS and how they all relate to each other. In part 7 and continuing into part 8, we introduced the performance monitor counters that give us insight into these counters, as well as introduced the SQLIO utility to stress test disk infrastructure. This set the scene for part 9, where we took a critical look at Microsoft’s own real-world findings to help us understand what suitable figures would be. The last post then introduced a couple of other excellent tools, namely Process Monitor and Windows Performance Analysis Toolkit that should be in your arsenal for SharePoint performance.

In this final article, we will tie up a few loose ends.

Insights from Microsoft labs testing…

In part 9 of this series, I examined Microsoft’s performance figures reported from their production SharePoint 2010 deployments. This information comes from the oft mentioned SharePoint 2010 Capacity Planning Guide. Microsoft are a large company and they have four different SharePoint farms for different collaborative scenarios. To recap, those scenarios were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company. Performance reported for this environment was 33580 unique users per day, with an average of 172 concurrent users and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. Performance reported for this environment was double the first environment with 69702 unique users per day. Concurrency was more than double, with an average of 420 concurrent users and a peak concurrency of 1433 users.
  3. Departmental collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department. Performance reported for this environment was a much lower figure of 9186 unique users per day (which makes sense given it is departmental stuff). Nevertheless, concurrency was similar to the enterprise intranet scenario with an average of 189 concurrent users and a peak concurrency of 322 users.
  4. Social collaboration environment. This is Microsoft’s My Sites scenario, connecting employees with one another and presenting personal information such as areas of expertise, past projects, and colleagues to the wider organization. This included personal sites and documents for collaboration. Performance reported for this environment was 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

Presented as a table, we have the following rankings:

Scenario Social Collaboration Enterprise Intranet Collaboration Enterprise Intranet Departmental Collaboration
Unique Users 69814 69072 33580 9186
Avg Concurrent 639 420 172 189
Peak  Concurrent 1186 1433 376 322

When you think about it, the performance information reported for these scenarios are lag indicator based. That is, they are real-world performance statistics based on a pre-existing deployment. Thus while we can utilise the above figures for some insights into estimating the performance needs of our own SharePoint environments, they lack important detail. For example: in each scenario above, while the SharePoint farm topology was specified, we have no visibility into how these environments were scaled out to meet performance needs.

Some lead indicator perspective…

Luckily for us, Microsoft did more than simply report on the performance of the above four collaboration scenarios. For two of the scenarios Microsoft created test labs and published performance results with different SharePoint farm topologies. This is really handy indeed, because it paints a much better lead indicator scenario. We get to see what bottlenecks occurred as the load on the farm was increased. We also get insight about what Microsoft did to alleviate the bottlenecks and what sort of a difference it made.

The first lab testing was based off Microsoft’s own Departmental collaboration environment (the 3rd scenario above) and is covered on pages 144-162 of the capacity planning guide. The other lab was based off the Enterprise Intranet Collaboration Environment (the 2nd scenario above) and is the focus of attention on pages 174-197. Consult the guide for full detail of the tests. This is just a quick synthesis.

Lab 1: Enterprise Intranet Collaboration Environment

In this lab, Microsoft took a subset of the data from their production environment using different hardware. They acknowledge that the test results will be affected by this, but in my view it is not a show stopper if you take a lead indicator viewpoint. Microsoft tested web server scale out initially by starting out with a 3 server topology (one web front end server, one application server and one database server). They then increased the load on the farm until they reached a saturation point. Once this happened, they added an additional web server to see what would happen. This was repeated and scaled from one Web server (1x1x1) to five Web servers (5x1x1).

The transactional mix used for this testing was based on the breakdown of transactions from the live system. Little indication of read vs. write transactions are given in the case study, but on page 152 there is a detailed breakdown of SharePoint traffic by type. While I won’t detail everything here, regular old browser traffic was the most common, representing 36% of all test traffic. WebDAV came in second (WebDAV typically includes office clients and windows explorer view) representing 28.12 of traffic and outlook sync traffic third at 7.04%.

Below is a table showing the figures where things bottlenecked. Microsoft produce many graphs in their documentation so the figures below are an approximation based on my reading of them. It is also important to note that Microsoft did not perform tests while search was running, and compensated for search overhead by defining a max CPU limit for SQL Server of 80%.

1*1*1 2*1*1 3*1*1 4*1*1 5*1*1
Max RPS 180 330 510 560 565
Sustainable RPS 115 210 305 390 380
Latency .3 .2 .35 .2 .2
IOPS 460 710 910 920 840
WFE CPU 96% 89% 89% 76% 58%
SQL CPU 17% 33% 65% 78% 79%

For what its worth, the sustainable RPS figure is based on the server not being stressed (all servers having less than 50% CPU). Looking at the above figures, several things are apparent.

  1. The environment scaled up to four Web servers before the bottleneck changed to be CPU usage on the database server
  2. Once database server CPU hit its limits, RPS on the web servers suffered. Note that RPS from 4*1*1 to 5*1*1 is negligible when SQL CPU was saturated.
  3. The addition of the fourth Web server had the least impact on scalability compared to the preceding three (RPS only increased from 510 to 560 which is much less then adding the previous web servers). This suggests the SQL bottleneck hit somewhere between 3 and 4 web servers.
  4. The average latency was almost constant throughout the whole test, unaffected by the number of Web servers and throughput. This suggests that we never hit any disk IO bottlenecks.

Once Microsoft identified the point at which Database server CPU was the bottleneck (4*1*1), they added an additional database server and then kept adding more webservers like they did previously. They split half the content databases onto one SQL server and half on the other. It is important to note that the underlying disk infrastructure was unchanged, meaning that total disk performance capability was kept constant even though there were now two database servers. This allowed Microsoft to isolate server capability from disk capability. Here is what happened:

4*1*1 4*1*2 6*1*2 8*1*2
RPS 560 660 890 930
Latency .2 .35 .2 .2
IOPS 910 1100 1350 1330
WFE CPU 76% 87% 78% 61%
SQL CPU 78% 33% 52% 58%

Here is what we can glean from these figures.

  1. Adding a second database server did not provide much additional RPS (560 to 660). This is because CPU utilization on the Web servers was high. In effect, the bottleneck shifted back to the web front end servers.
  2. With two database servers and eight web servers (8*1*2), the bottleneck became the disk infrastructure. (Note the IOPS at 6*1*2 is no better than 8*1*2).

So what can we conclude? From the figures shown above, it appears that you could reasonably expect (remember we are talking lead indicators here) that bottlenecks are likely to occur in the following order:

  1. Web Server CPU
  2. Database Server CPU
  3. Disk IOPS

It would be a stretch to suggest when each of these would happen because there are far too many variables to consider. But let’s now examine the second lab case study to see if this pattern is consistent.

Lab 2: Divisional Portal Environment

In this lab, Microsoft took a different approach from lab we just examined. This time they did not concern themselves with IOPS (“we did not consider disk I/O as a limiting factor. It is assumed that an infinite number of spindles are available”). The aim this time was to determine at what point a SQL Server CPU bottleneck was encountered. Based on what I have noted from the first lab test above, unless your disk infrastructure is particularly crap, SQL Server CPU should become a bottleneck before IOPS. However, one thing in common with the last lab test was that Microsoft factored in the effects of an ongoing search crawl by assuming 80% SQL Server CPU as the bottleneck indicator.

Much more detail was provided on the transaction breakdown for this lab. Page 181 and 182 lists transactions by type and and unlike the first lab, whether they are read or write. While it is hard to directly compare to lab 1, it appears that more traffic is oriented around document collaboration than in the first lab.

The basic methodology of this was to start off with a minimal farm configuration of a combined web/application server and one database server. Through multiple iterations, the test ended with a configuration of three Web servers, one application server, one database server.  The table of results are below:

1*1 1*1*1 2*1*1 3*1*1
RPS 101 150 318 310
Sustainable RPS 75 99 191 242
Latency .81 .85 .6 .8
Users simulated 125 150 200 226
WFE CPU 86% 36% 76% 42%
APP CPU NA 41% 46% 44%
SQL CPU 18% 32% 56% 75%

Here is what we can glean from these figures.

  1. Web Server CPU was the first bottleneck encountered.
  2. At a 3*1*1 configuration, SQL Server CPU became the bottleneck.  In lab 1 it was somewhere between the 3rd and 4th webserver.
  3. RPS, when CPU is taken into account, is fairly similar between each lab. For example, in the first lab, the 2*1*1 scenario RPS was 330. In this lab it was 318 and both had comparable CPU usage. The 1*1*1 test, had differing results (101 vs 180) , but if you adjust for the reported CPU usage, things even up.
  4. With each additional Web server, increase in RPS was almost linear. We can extrapolate that as long as SQL Server is not bottlenecked, you can add more Web servers and an additional increase in RPS is possible.
  5. Latencies are not affected much as we approached bottleneck on SQL Server. Once again, the disk subsystem was never stressed.
  6. The previous assertion that bottlenecks are likely to occur in the the order of Web Server CPU, Database Server CPU and then Disk subsystem appears to hold true.

Now we go any further, one important point that I have neglected to mention so far is that the figures above are extremely undesirable. Do you really want your web server and database server to be at 85% constantly? I think not. What you are seeing above are the upper limits, based on Microsoft’s testing. While this helps us understand maximum theoretical capacity, it does not make for a particularly scalable environment.

To account for the issue of reporting on max load, Microsoft defined what they termed as a “green zone” of performance. This is a term to describe what “normal” load conditions look like (for example, less than 50% CPU) and they also provided RPS results for when the servers were in that zone. If you look closely in the above tables you will see those RPS figures there as I labelled them as “Sustainable RPS”.

In case you are wondering, the % difference between sustainable RPS and peak RPS for each of the scenarios ranges between 60-75% of the peak RPS reported.

Some Microsoft math…

In the second lab, Microsoft offers some advice on how translate their results into our own deployments. They suggest determining a users to RPS ratio and then utilising the green zone RPS figures to estimate server requirements. This is best served via their own example from lab 2: They state the following:

  • A divisional portal in Microsoft which supports around 8000 employees collaborating heavily, experiences an average RPS of 110.
  • That gives a Users to RPS ratio of ~72 (that is, 8000/110). That is: 72 users will amount to 1 RPS.
  • Using this ratio and assuming the sustainable RPS figures from lab 2 results, Microsoft created the following table (page 196) to suggest the number of users a typical deployment might support.

image

A basic performance planning methodology…

Okay.. so I am done… I have no more topics that I want to cover (although I could go on forever on this stuff). Hopefully I have laid out enough conceptual scaffolding to allow you to read Microsoft’s large and complex documentation regarding SharePoint performance and capacity guide with more clarity than before. If I were to sum up a few of the key points of this 11 part exploration into the weird and wonderful world of SharePoint performance management it would be as follows:

  1. Part 1: Think of performance in terms of lead and lag indicators. You will have less of a brain fart when trying to make sense of everything.
  2. Part 2: Requests are often confused with transactions. A transaction (eg “save this document”) usually consists of multiple requests and the number of requests is not an indicator of performance. Look to RPS to help here…
  3. Part 3 and 4: The key to utilising RPS is to understand that as a counter on its own, it is not overly helpful. BUT it is the one metric that you probably have available in lots of detail, due to it being captured in web server logs over time. Use it to understand usage patterns of your sites and portals and determine peak usage and concurrent usage.
  4. Part 5: Latency (and disk latency in particular) is both unavoidable, yet one of the most common root causes of performance issues. Understanding it is critical.
  5. Part 6: Disk latency affects – and is affected by – IOPS, IO size and IO patterns. Focusing one one without the others is quite pointless. They all affect each other so watch out when they are specified in isolation (ie 5000 IOPS).
  6. Part 6, 7 and 8:  Latency and IOPS are handy in the  sense that they can be easily simulated and are therefore useful lead indicators. Test all SQL IO scenarios at 8k and 64K IO size and ensure it meets latency requirements.
  7. Part 9: Give your SAN dudes a specified IOPS, IO Size and latency target. Let them figure out the disk configuration that is needed to accommodate. If they can make your target then focus on other bottleneck areas.
  8. Part 10: Process Monitor and Windows Performance Analyser are brilliant tools for understanding disk IO patterns (among other things)
  9. Part 9 and 11: Don’t believe everything you read. Utilise Microsoft’s real world and lab results as a guide but validate expected behaviour by testing your own environment and look for gaps between what is expected and what you get.
  10. Part 11: In general, Web Server CPU will bottleneck first, followed by SQL Server CPU. If you followed the advice of points 6 and 7 above, then disk shouldn’t  be a problem.

Now I promised at the very start of this series, that I would provide you with a lightweight methodology for estimating SharePoint performance requirements. So assuming you have read this entire series and understand the implications, here goes nothing…

If they can meet this target, skip to step 8.  :-)

If they cannot meet this, don’t worry because there are two benefits gained already. First, by finding that they cannot get near the above figures, they will do some optimisation like test different stipe sizes and check some other common disk performance hiccups. This means they now better understand the disk performance patterns and are thinking in terms of lead indicators. The second benefit is that you can avoid tedious, detailed discussions on what RAID level to go with up front.

So while all of this is happening, do some more recon…

  • 4. Examine Microsoft and HP’s testing results that I covered in part 9 and in this article. Pay particular attention to the concurrent users and RPS figures. Also note the IOPS results from Microsoft and HP testing. To remind you, no test ever came in over 1400 IOPS.
  • 5. Use logparser to examine your own logs to understand usage patterns. Not only should you eke out metrics like max concurrent users and RPS figures, but examine peak times of the day, RPS growth rate over time, and what client applications or devices are being used to access your portal or intranet.
  • 6. Compare your peak and concurrent usage stats to Microsoft and HP’s findings. Are you half their size, double their size? This can give you some insight on a lower IOPS target to use. If you have 200 simultaneous users, then you can derive a target IOPS for your storage guys to meet that is more modest and in line with your own organisations size and make-up.

By now the storage guys will come back to you in shock because they cannot get near your 5000 IOPS requirement. Be warned though… they might ask you to sign a cheque to upgrade the storage subsystem to meet this requirement. It won’t be coming out of their budget for sure!

  • 7. Tell them to slowly reduce the IOPS until they hit the 8ms and 1ms latency targets and give them the revised target based on the calculation you made in step 6. If they still cannot make this, then sign the damn cheque!

At this point we have assumed that there is enough assurance that the disk infrastructure is sound. Now its all about CPU and memory.

  • 8. Determine a users to RPS ratio by dividing your total user base by average RPS (based on your findings from step 5).
  • 9.  Look at Microsoft’s published table (page 196 of the capacity planning guide and reproduced here just above this conclusion). See what it suggests for the minimum topology that should be needed for deployment.
  • 10. Use that as a baseline and now start to consider redundancy, load balancing and all of that other fun stuff.

And there you have it! My 10 step dodgy performance analysis method.  Smile

Conclusion and where to go next…

Right! Although I am done with this topic area, there are some next steps to consider.

Remember that this entire series is predicated on the notion that you are in the planning stage. Let’s say you have come up with a suggested topology and deployed the hardware and developed your SharePoint masterpiece. The job of ensuring performance will work to expectations does not stop here. You still should consider load testing to ensure that the deployed topology meets the expectations and validates the lead indicators. There is also a  seemingly endless number of optimisations that can be done within SharePoint too, such as caching to reduce SQL Server load or tuning web application or service application settings.

But for now, I hope that this series has met my stated goal of making this topic area that little bit more accessible and thankyou so much for taking the time to read it all.

 

Paul Culmsee

www.hereticsguidebooks.com

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Aug 08 2012

Demystifying SharePoint Performance Management Part 10 – More tools of the trade…

Send to Kindle

Hi all and welcome to the tenth article in my series on demystifying SharePoint performance management. I do feel that we are getting toward the home stretch here. If you go way back to Part 1, I stated my intent to highlight some common misconceptions and traps for younger players in the area of SharePoint performance management, while demonstrating a better way to think about measuring SharePoint performance (i.e. lead and lag indicators). While doing so, we examined the common performance indicators of RPS, IOPS, MBPS, latency and the tools and approaches to measuring and using them.

I started the series praising some of Microsoft’s material, namely the “Planning guide for server farms and environments for Microsoft SharePoint Server 2010.”, “Capacity Planning for Microsoft SharePoint Server 2010” and “Analysing Microsoft SharePoint Products and Technologies Usage” guides. But they are not perfect by any stretch, and in last post, I covered some of the inconsistencies and questionable information that does exist in the capacity planning guide in particular. Not only are some of the disk performance figures quoted given without any critical context needed to understand how to measure them in a meaningful way, some of the figures themselves are highly questionable.

I therefore concluded Part 9 by advising readers not to believe everything presented and always verify espoused reality with actual reality via testing and measurement.

Along the journey that we have undertaken, we have examined some of the tools that are available to perform such testing and measurement. So far, we have used Log Parser, SharePoint Flavored Weblog Reader, Windows Performance Monitor, SQLIO and the odd bit of PowerShell thrown in for good measure. This article will round things out by showing you two additional tools to verify theoretical fiction with hard cold reality. Both of these tools allow you to get a really good sense of IO patterns in particular (although they both have many other purposes). The first of which will be familiar to my more nerdy readers, but the second of which is highly powerful, but much lesser known to newbies and seasoned IT pros alike.

So without further adieu, lets meet our tools… Process Monitor and Windows Performance Analysis Toolkit.

Process Monitor

Our first tool is Process Monitor, also commonly known as Procmon. Now this tool is quite well known, so I will not be particularly verbose with my examination of it. But for the three of you who have never heard of this tool, Process Monitor allows us to (among many other things), monitor accesses to the file system by processes running on a server. This allows us to get a really low level view of IO requests as they happen in real time. What is really nice about Process Monitor is its granularity. It allows you to set up some sophisticated filtering that allows you to really see the wood from the trees. For example, one can create fairly elaborate filters that allow you to just see the details of a specific SQL database. Also handy is that all collected data can be saved to file for later examination too.

When you start Process Monitor, you will see a screen something like the one below. It will immediately start collecting data about various operations (there are around 140 monitorable operations that cover file system, registry, process, network and kernel stuff). When you launch Process Monitor it immediately starts monitoring file system, registry and processes. The default columns that are displayed include:

  • the name of the process performing the operation
  • the operation itself
  • the path to the object the operation was performed on
  • (and most importantly), a detail column, that tells you the good stuff.

image

The best way to learn Process Monitor is by example, so lets use it to collect SQL Server IO patterns on SharePoint databases when performing a full crawl in SharePoint (while excluding writes to transaction logs). It will be interesting to see the the range of IO request sizes during this time. To achieve this, we need to set up the filters for procmon to give us just what we need…

First up,  choose “Filter…” from the Filter menu.

image

In the top left column, choose “Process Name” from the list of columns. Leave the condition field as “is” and click on the drop down next to it. It will enumerate the running processes, allowing you to scroll down and find sqlservr.exe.

image   image

Click OK and your newly minted filter will be added to the list (check out the green include filter below). Now we will only see operations performed by SQL Server in the Process Monitor display.

image

Rather than give you a dose of screenshot hell, I will not individually show you how to add each filter as they are a similar process to what we just did to include only SQLSERVR.EXE. In all, we have to apply another 5 filters. The first two filter the operations down to reading and writing to the disk.

  • Filter on: Process Name
  • Condition: Is
  • Filter  applied: ReadFile
  • Filter type: Include
  • Filter on: Process Name
  • Condition: Is
  • Filter applied: WriteFile
  • Filter type: Include

Now we need to specify the database(s) that we are interested in. On my test server, SharePoint databases  are on a disk array mounted as D:\ drive. So I add the following filter:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: D:\DATA\MSSQL
  • Filter type: Include

Finally, we want to exclude writes to translation logs. Since all transaction logs write to files with an .LDF extension, we can use an exclusion rule:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: LDF
  • Filter type: Exclude

Okay, so we have our filters set. Now widen the detail column that I mentioned earlier. If you have captured some entries, you should see the word “Length :” with a number next to it. This is reporting the size of the IO request in bytes. Divide by 1024 if you want to get to kilobytes (KB). Below you can see a range of 1.5KB to 32KB.

image

At this point you are all set. Go to SharePoint central administration and find the search service application. Start a full crawl and fairly quickly, you should see matching disk IO operations displayed in Process Monitor. When the crawl is finished, you can choose to stop capturing and save the resulting capture to file. Process Monitor supports CSV format, which makes it easy to import into Excel as shown below. (In the example below I created a formula for the column called “IO Size”.

image

By the way, in my quick test analysis of disk IO of a window during during part of the during a full crawl, I captured 329 requests that were broken down as follows:

  • 142 IO requests (42% of total) were 8KB in size for a total of 1136KB
  • 48 IO requests (15% in total) were 16KB in size for a total of 768KB
  • 48 IO requests (15% in total) were >16KB to 32KB in size for a total of  1136KB
  • 49 IO requests (15% in total) were >32KB to 64KB in size for a total of 2552KB
  • 22 IO requests (7% in total) were >64KB to 128KB in size for a total of 2104KB
  • 20 IO requests (6% in total) were >128KB to 256KB in size for a total of 3904KB

Windows Performance Analyser (with a little help from Xperf123)

Allow me to introduce you to one of the best tools you never knew you needed. Windows Performance Analyser (WPA) is a newer addition to the armoury of tools for performance analysis and capacity planning. In short, it takes the idea of Windows Performance Monitor to a whole new level. WPA comes as part of a broader suite of tools collectively known as the Windows Performance Toolkit (WPT). Microsoft describes the toolkit as:

…designed for analysis of a wide range of performance problems including application start times, boot issues, deferred procedure calls and interrupt activity (DPCs and ISRs), system responsiveness issues, application resource usage, and interrupt storms.”

If most of those terms sounded scary to you, then it should be clear that WPA is a pretty serious tool and has utility for many things, going way beyond our narrow focus of Disk performance. But never fear BA’s, I am not going to take a deep dive approach to explaining this tool. Instead I am going to outline the quickest and simplest way to leverage WPA for examining disk IO patterns. In fact, you should be able to follow what I outline here on your own PC if SharePoint is not conveniently located nearby.

Now this gem of a tool is not available as a separate download. It actually comes installed as part of the Microsoft Windows SDK for Windows 7 and .NET Framework 4. Admins fearing bloat on their servers can rest easy though, as you can choose just to install the WPT components as shown below…

image_thumb6_thumb

By default, the windows performance toolkit will install its goodies into the C:\Program Files\Microsoft Windows Performance Toolkit” folder. So go ahead and install it now since it can be installed onto any version of Windows past Vista. (I am sure that none of you at all are reading this article on an Apple device right? :-).

Now assuming you have successfully installed WPT, I now want you to head on over to codeplex and download a little tool called Xperf123 and save it into the toolkit folder above. Xperf123 is a 3rd party tool that hides a lot of the complexity of WPA and thus is a useful starting point. The only thing to bear in mind is that Xperf123 is not part of WPA and is therefore not a necessity. If your inner tech geek wants to get to know the WPA commands better, then I highly recommend you read a highly comprehensive article written by Microsoft’s Robert Smith and published back in Feb 2012. The article is called “Analysing Storage Performance using the Windows Performance Analysis Toolkit” and it is an outstanding piece of work in this area.

So we are all set. Let’s run the same test as we did with Procmon earlier. I will start a trace on my test SharePoint server, run a full crawl and then look at the resulting IO patterns. Perform the following steps in sequence. (in my example I am using a test SharePoint server).

  1. Start Xperf123 from the WPT installation folder (run it as administrator).
  2. At the initial screen, click Next and then Next again at the screen displaying operating system details
  3. From the Select Trace Type dropdown, choose Disk  I/O and press Next
  4. Ensure that “Enable Perfmon”, “use Circular Logging” and optionally choose “Specify Output Directory”. Press Next
  5. Leave “Stackwalk” unticked and choose Next

image     image

image  image

AllrIghtie then… we are all set! Click the Start Capture Button to start collecting the good stuff! Xperf123 will run the actual WPA command line trace utility (called xperf.exe if you really want to know). Now go to SharePoint central administration and like what we did with our test of Process Monitor, start a full crawl. Wait till the crawl finishes and then in Xperf123, click the Stop Capture Button. A trace file will be saved in the WPT installation folder (or wherever you specify). The naming convention will be based on the server name and date the trace was run.

image  image

image

Okay, so capturing the trace was positively easy. How about analysing it visually?

Double click on the newly minted trace file and it will be loaded into the Performance Analyser analysis tool (This tool is also available from the Start menu as well). When the tool loads and processes the trace file, you will see CPU and Disk IO counts reported visually. The CPU is the line graph and IO counts are represented in a bar graph. Unlike Windows Performance Monitor which we covered in Part 7, this tool has a much better ability to drill down into details.

If you look closely below there are two “flyout” arrows. One is on the left side in the middle of the screen and applies to all graphs and the other is on the top-right of each graph. If you click them, you are able to apply filters to what information is displayed. For example: if you click the “IO Counts” flyout, you can filter out which type of IO you want to see. Looking at the screenshot below, you can see that the majority of what was captured were disk writes (the blue bars below).

image  image

Okay so lets get a closer look at what is going on disk-wise. Right click somewhere on the Disk IO bar graph and choose “Detail Graph” from the menu.

image

Now we have a very different view. On the left we can see which disk we are looking at and on the right we can see detailed performance stats for that disk. It may not be clear by the screenshot, but the disk IO reported below is broken down by process. This detailed view also has flyouts and dropdowns that allow you to filter what you see. There is an upper-left dropdown menu under the section called “Physical Disk”. This allows you to change what disk you are interested in. On the upper right, there is a flyout labelled “Process Name”. Clicking this allows you to filter the display to only view a subset of the process that were running at the time the trace was captured.

image

Now in my case, I only want to see the SQL Database activity, so I will make use of the aforementioned filtering capability. Below I show where I selected the disk where the database files reside and on the right I deselected all processes apart from SQLSERVR.EXE. Neat huh? Now we are looking at the graph of every individual IO operation performed during the time displayed and you can even hover over each dot to get more detail of the IO operation performed.

image

You can also zoom in with great granularity. Simply select a time period from the display using by dragging the mouse pointer over the graph area. Right click the now selected time period and choose “Zoom to Selection”. Cool eh? If your mouse has a wheel button, you can zoom in and out by pressing the Ctrl key and rolling the mouse wheel.

image

Now even for most wussy non technical BA reading this, surely your inner nerd is liking what you see. But why stop here? After all, Process Monitor gave us lots more loving detail and had the ability to utilise sophisticated filtering. So how does WPA stack up?

To answer this question, try these steps, Right click on the detail graph and this time choose “Summary Table”. This allows us to view even more detail of IO data.

image

image

Viola! We now have a list of every IO transaction performed during the sample period. Each line in the summary table represents a single I/O operation. The columns are movable and sortable as well. On that note, some of the more interesting ones for our purposes include (thanks Robert Smith for the explanation of these):

  • IO Type: Read, Write, or Flush
  • Complete Time: Time of I/O completion in milliseconds, relative to start and stop of the current trace.
  • IO Time: The amount of time in milliseconds the I/O took to complete
  • Disk Service Time: The inferred amount of time (in microseconds) the IO operation has spent on the device (this one has caveats, check Robert Smiths post for detail).
  • QD/I: Queue Depth the disk , irrespective of partitions, at the time this I/O request initialized
  • IO Size: Size of this I/O, in bytes.
  • Process Name: The name of the process that initiated this I/O.
  • Path: Path and file name, if known, that is the target of this I/O (in plain English, this essentially means the file name).

I have a lot of IO requests in this summary view, so let’s see how this baby can filter. First up, lets only look at IO that was initiated from SQL Server only. Right click on the “Process Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “SQLSERVR.EXE” in the “Find what:” textbox. Double check that the column name is set to “Process name” in the dropdown and click Filter.

image  image

You should now see only SQL IO traffic. So let’s drill down further still. This time I want to exclude IO transactions that are transaction log related. To do this, right click on the “Path Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “MDF” in the “Find what:” textbox. Double check that the column name is set to “Path name” in the dropdown and click Filter.

image image

Can you guess the effect? Only SQL Server database files will be displayed since they typically have a file extension of MDF.

In the screenshot below, I have then used the column sorting capability to look at the IO sizes. Neat huh?

image

Don’t forget Performance Monitor…

Just before we are done with Windows Performance Analysis Toolkit, cast your mind back to the start of this walkthrough when we used Xperf123 to generate this trace. If you check back, you might recall there was a tickbox in the Xperf123 wizard called “Enable Perfmon”. Well it turns out that Xperf123 had one final perk. While the WPA trace was made, a Perfmon trace was made of the system performance more broadly during the time. These logs are located in the C:\PerfLogs\ directory and are saved in the native Windows Performance Monitor format. So just double click the file and watch the love…

image

How’s that for a handy added bonus. It is also worth mentioning that the Perfmon trace captured has a significant number of performance counters in the categories of Memory, PhysicalDisk, Processor and System.

Conclusion and coming next…

Well! That was a long post, but that was more because of verbose screenshots than anything else.

Both Windows Performance Monitor and Windows Performance Analyser are very useful tools for developing a better understanding of disk IO patterns. While Procmon has more sophisticated filtering capabilities, WPA trumps Procmon in terms of reduced overhead (apparently 20,000 events per second is less than 2% CPU on a 2.0 GHz processor ). WPA also has the ability to visualise and drill down into the data better than Procmon can do.

Nevertheless, both tools have far more utility beyond the basic scenarios outlined in this series and are definitely worth investigating more.

In the next and I suspect final post, I will round off this examination of performance by making a few more general SharePoint performance recommendations and outlining a lightweight methodology that you can use for your own assessments.

Until then, thanks for reading…

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Jul 30 2012

Demystifying SharePoint Performance Management Part 9 – Don’t believe everything you R/W

Send to Kindle

Hi and welcome to Part 9 (bloody hell… nine!) of my series on trying to demystify SharePoint performance management a bit. If by any chance you have been asked to provide some sizing information for your organisation and you are finding the resources online a bit overwhelming, this series is for you. If you have been a part of our varied journey so far, the last few posts have been all about Disk IO performance in the form of latency, IOPS and MBPS. In the last two articles, we have been learning about the different IO patterns that SQL Server is likely to utilise, as well as using the jackhammer known as the SQLIO utility, that is used to simulate those IO patterns on unsuspecting disk infrastructure.

Now just to set the scene for this post (and conveniently perform some product placement), I recently published a book called “The Heretics Guide to Best Practices”. Now being the author and all, I am going to suggest you buy it because it is a completely riveting read! :-).

Now apart from blatant product placement, the real reason I mention it is because one of the chapters is called “Myths, Memes and Methodologies”. In it, we examine why some ideas gain legitimacy, even though they are based on often completely dodgy foundations. I mention this here, because in terms of SQL disk IO sizing, something similar has happened with Microsoft’s published material on the topic. So the focus of this article is to finish off our discussion on understanding disk IO patterns, while lifting the lid on some of the inconsistencies in the material that that end up being repeated by SharePoint consultants as gospel to their unsuspecting disciples.

Now harking way back to part 1 to the notion of lead vs. lag indicators, our use of SQLIO thus far has essentially been used as a lead indicator. While SQLIO puts a real load on disk infrastructure and faithfully reports the resulting IOPS, latency and MBPS, the reality is it can never truly capture the nuances of a production SharePoint farm doing its thing. But in terms of a lead indicator that is okay. After all, a lead indicator by definition cannot guarantee an outcome. It can merely suggest that an outcome should be able to be met.

So while we are thinking about the lead indicator world view, some of you might have noticed that I have not yet made any suggestions what are the minimum conditions of satisfaction for disk infrastructure used to underpin SharePoint. This has been deliberate until now, because I felt that it was critical to understand the relationship between the size of a disk IO operation, and its effect on IOPS, latency and MBPS first. To that end, hopefully I have instilled a reflex in you where – if you are given an arbitrary latency, IOPS or MBPS figure that you have to meet – you immediately ask questions like, “What sort of IO patterns?” or “how large is the IO request typically going to be?” or “is the IO random or sequential?”

When whitepapers mislead…

Now we are about to get into one area where Microsoft’s published documentation is quite weak. Remember the 367 page “Capacity Planning for Microsoft SharePoint Server 2010” whitepaper? Starting at page 326, there is a section with the promising title of “Estimate Core Storage and IOPS needs” (this topic is also available separately as a technet article too). The problem is in despite that title, very little IOPS guidance actually is given. Instead the content in the section overwhelmingly speaks about estimating storage requirements. In fact the best you get is one explicit mention of IOPS in relation to the SharePoint Search service application which states the following:

The IOPS requirements for Search are significant.

  • For the Crawl database, search requires from 3,500 to 7,000 IOPS.
  • For the Property database, search requires 2,000 IOPS.

Note: For the purpose of the rest of this article, lets add the above figures together and simply say between 5,500 to 9000 IOPS for search.

Do you see the problem here? This is simply an arbitrary IOPS figure with no guidance as to the IO patterns underpinning it. What about latency or the IO request size that you need to assume? Unfortunately, no guidance is given for these questions which makes this quoted figure not overly helpful. Plus, as you will soon see below, Microsoft seemingly contradict themselves elsewhere in the same whitepaper…

So what are good numbers to use?

In the absence of any hard data, the best way to deal with storage requirements is to think in terms of lead indicators. Indicators from a lead point of view, can be framed as targets – something to aim for. Targets then can be broken down into different categories ranging from “cover your arse” to “above and beyond”:

  • Aspired target: The “this would be bloody fantastic if we could get there” target.
  • Agreed target: The “this is what we are going to deliver no matter what” target.
  • Minimum Condition of Satisfaction (MCOS) target: The “If we don’t achieve this we may as well pack up and go home” target.

So given these sorts of targets, what should the disk IO performance targets for SharePoint be? To work this out, we can utilise information already out there. Well…that is, we could if the information out there wasn’t so disparate and disconnected. So unfortunately, it takes some digging to you can find what you need.

Our first point of call in this regard is indeed Microsoft and the very same capacity planning and configuration guide that I criticised earlier for poorly dealing with IOPS. Hidden in the bowls of that document, the following statement is made on page 334 (emphasis mine):

Any storage architecture must support your availability needs and perform adequately in IOPS and latency. To be supported, the system must consistently return the first byte of data within 20 milliseconds.

So the way I look at it, a 20ms latency should be our MCOS target (see the explanation above for MCOS). If we consistently do worse than this, then we do not have a lot of assurance about the scalability of the disk IO subsystem being used for SharePoint. But like the arbitrary IOPS figure quoted in the previous section, I wonder if readers have spotted the problem with specifying this latency figure alone?

In both cases, don’t forget the almost symbiotic type of relationships between IO size, IOPS and latency. If we assumed that all IO operations were small (for example SQL’s page size of 8KB) then we could likely stay way under the 20ms limit with a more modest disk infrastructure. But to sustain the same latency with a larger IO size would require a faster disk subsystem. Why? Well as we discussed in part 6, if the size of the IO writes are larger, such as 64KB, then latency will go up because servicing larger requests takes longer than smaller ones. Therefore, if we were to assume a larger IO size, we would need more/faster disks to be able to meet the same 20ms latency KPI.

So what disk IO size should we assume to give context to a latency figure? Some insight can be found back in part 6, when we examined SQL IO characteristics and established that it will likely be much more varied than SQL’s base IO unit of 8KB pages. My suggestion therefore, is to test 8KB but also ensure that 64KB can meet the latency target. This is because 64KB represents a reasonable average size between the 8KB to 256KB range most SQL Server’s IO operations will fall within. Thus, if a SQLIO test using random read/writes at 64K indicates more than 20ms latency consistently, then you should probably ask your storage people to take another look at it.

By the way, if you really want to give your storage guys a challenge, keep jacking up the IO size!

What about aspired latency targets?

So if you are cool with the notion that the minimum condition of satisfaction for a random IO test using 64K size should be less than 20ms latency, what about aiming higher with agreed or aspirational targets?

Luckily for all of us, we can once again stand on the shoulders of giants. In this case, the Bob Duffy indirectly answers this question by providing what he considers to be the indicators for optimal SQL Server performance in general. In an excellent article with the rather appropriate title of “How to Specify SQL Storage Requirements to your SAN Dude” Bob makes the following recommendations:

  • SQL Data files must have a response time averaging about 8ms and a maximum response time of around 20ms using 64k size IOs and that are random in nature
  • SQL Log Files must have a write response time averaging from 1-5ms. use 64k IO size and are sequential in nature

The nice thing about specifying a target or benchmark like this, is that you are able to sidestep discussions on RAID levels, stripe sizes and many other things that SAN nerds find interesting. We keep things focused on the lead indicators and in effect state “If you can meet these figures, configure it any way you like.” This gives the SAN guys the freedom to do their job, while giving you an indicator that can give you confidence in the disk infrastructure. So if we were to distil the figures above into lead indicator targets for storage gurus, it might look something like this:

  • MCOS target: Less than 20ms latency for random IO requests of 64KB
  • Agreed target: Average 8ms latency for random IO requests of 64KB with no more than 20ms max latency. Less than 5ms latency for sequential log IO
  • Aspired target: No more than 8ms latency for random SQL IO requests of 64KB and average of 1ms latency sequential log IO with max never going above 5ms

Now in the ProData article, Bob made a slightly tongue in cheek point that sums up the above thinking really well, as well as giving insight to a critical aspect we have not considered so far…

Nowadays most SQL consultants try and not talk about RAID types and types of disk, it can be best to leave that up to the storage guys. If the storage team can meet my requirement for 5,000 random 64k read/write IOPs at 8ms latency by using 50 old SATA drives at 5,400 rpm in RAID 5 then knock yourself out – I’m happy. Well maybe I’m happy till we have that chat about Service Level requirements during a disk degrade event but that’s a different story…

If you look closely at Bob’s quote, you will see that he has also specified the last critical variable in the mix. Bob’s mention of “5000 random 64k read/write IOPS” is in reference to another point he makes. Without an IOPS figure to work from, the targets we have come up with are effectively meaningless. Quoting Bob:

The main thing to specify apart form your latency requirement is the throughput (IOPs). It is no good meeting the 8ms target for 100 IOPs and then finding your workloads needs 5,000 IOPs. You wont be able to meet the 8ms target!!

Consider it this way… a SharePoint site that services 100,000 users, will process a lot more IO requests than a site that services 10 users. With the latter, it is quite likely that the latency targets we have been talking about (even the aspirational ones) would be pretty easy to meet with a single disk. (To hark back to our shopping centre metaphor, one check out operator is all that is needed at a corner store, whereas many are needed at the supermarket). This is presumably why Bob has used a figure like 5000 IOPS for his post. It is probably a figure that conveniently represents some fairly heavy disk usage. But it does beg two question:

  • How much IOPS should we use to simulate SharePoint IOPS?
  • In the absence of anything else, perhaps 5000 IOPS is a good figure to go with?

Don’t believe all you read…

Now if you go back and read the start of this post, you will recall I mentioned that Microsoft stated some IOPS figures for the SharePoint search application databases ranged between 5,500 to 9000. That would indicate that Bob’s base figure of 5000 is a bit low, especially given that SharePoint has many other components beyond search that have not been taken into account. So to put Bob’s 5000 IOPS figure in perspective, let’s re-examine Microsoft’s trusty capacity planning whitepaper. One of the great things about this document is that Microsoft detailed the performance stats of a typical day in the life of their internal SharePoint environments. Since Microsoft are so large, they have different SharePoint farms for different collaborative scenarios. The scenarios they covered were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. They are typically used for enterprise collaboration, organizations, teams, and projects. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. No custom code runs in these sites.
  3. Departmental Collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department.
  4. Social Collaboration Environment. This is the My Sites scenario. These connect employees with one another and the information that they need. Employees use this environment to present personal information such as areas of expertise, past projects, and colleagues to the wider organization. The environment also hosts personal sites and documents for viewing, editing, and collaboration.

Now reading about these scenarios is highly interesting and Microsoft provides some nice nuggets of information that we will use in a future post. But for now I will stick purely to a disk IOPS perspective. To that end, below are a few fun-filled facts about the number of users in each of the four scenarios:

  1. Enterprise Intranet environment:  33580 unique users per day, with an average of 172 concurrent and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment: 69702 unique users per day, with an average of 420 concurrent users and a peak concurrency of 1433 users
  3. Departmental Collaboration environment. 9186 unique users per day, with an average of 189 concurrent users and a peak concurrency of 322 users
  4. Social Collaboration Environment. 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

So now you have a sense of the size of these scenarios and as an added bonus, gotten a glimpse into the difference that usage patterns can make. For example: social collaboration and enterprise collaboration have similar number of unique users but social has more average concurrency but less peak. But what about IOPS?

In the document, IOPS is split into reads per second and writes per second, so I added them to estimate IOPS. The results are rather surprising…

Metric

Social Collaboration

Departmental Collaboration

Published intranet

Intranet Collaboration

Unique visitors

69814

9186

33580

69702

Average concurrent

639

189

172

420

Max concurrent

1186

322

376

1433

IOPS

941

74

409.66

409.66

Now while it might be tempting to ponder why social collaboration has over double the IOPS, yet half the concurrency of enterprise intranet collaboration, we are not going to worry about here. Besides, we actually covered some of it already when we used logparser to get insights of usage patterns. What I will instead do is draw your attention to is the fact that that none of the IOPS scenarios come anywhere near the 5000 IOPS figures cited by ProData or Microsoft’s 5500-9000 IOPS cited for search (in the very same capacity planning document I might add!)

So something is amiss. If an organisation the size of Microsoft can have almost 70000 unique users per day, with a peak concurrency of 1433 users and only total 410 IOPS, then where the hell did the 5500-9000 IOPS figure for search alone come from? Even if you take the scenario with the highest IOPS (the Social collaboration scenario with 941 IOPS), that’s still less than one fifth 5500 IOPS which was at the low end of the search IOPS figure.

Now I am also suspicious that two different case studies have the exact same IOPS figure. If you compare the “published intranet” scenario with the “intranet collaboration” scenario, one has half the visitors, yet both have precisely the same IOPS (right down to decimal places). That seems highly unlikely to me and I suggest that a mistake has been made. Given the intranet collaboration has the highest max concurrency figure, I would have expected IOPS to be a higher than it is. Hmmm…

What can we take away from this? For one, the capacity planning document could seriously do with a rewrite in this area. Secondly, I don’t have a lot of faith in those IOPS figures quoted (although I have more confidence in the case studies that the arbitrary figures specified for search).

So if we put aside the doubt created by the issues with the capacity planning guide, there is one really interesting fact that remains… none of the reported IOPS figures came anywhere near 5000 IOPS.

Insights from HP…

It turns out that Hewlett Packard also did some load testing of SharePoint 2010 (among other things) and published a whitepaper called the “HP performance and configuration guide for Microsoft SharePoint 2010“.  In this guide, they detail the results of a scenario they tested based on what they termed an “Enterprise Workload”. The guide covers definition of enterprise workload in loving detail, but the gist of it is that it covers the following areas:

  • Document Center (30% of operations) Check-out, download, upload and check-in documents
  • Team Sites – (20% of operations) work with calendars, discussions and documents
  • Portal SItes – (20% of operations) work with event, announcements and surveys
  • My Sites – (10% of operations) work with documents in personal documents library
  • Search – (20% of operations) Submit searches with random word or phrases

HP then simulates 500 concurrent users performing the actions above. In Table 13 of the report (page 28 of their document and reproduced below) , HP outline the performance and even break down the IO characteristics of each SharePoint database (which is really handy indeed). Adding up the last column of transfers/sec (which is essentially IOPS) we get a result of 1347.33 IOPS.

Thus we are still considerably under the 5000 IOPS that Bob Duffy suggests.

Conclusion…

Right! Remember our discussion above on MCOS, agreed and aspired targets? For an aspirational target, I think that we can reasonably use 5000 IOPS as a starting point for an enterprise organisation of Microsoft’s size. If we stick with 5000 IOPS, then my suggestion for an aspirational latency target would be:

  1. no more than 8ms latency for random SQL IO requests of 64KB
  2. average 1ms (and no more than 5ms max) latency of sequential log IO of 64KB

I think these figures are a pretty good test of a disk subsystem and think that Bob at ProData is therefore pretty close to the mark. Of course, you can use these figures to make your own judgement and adjust accordingly. Provided that you think of them as lead indicators that provide you a level of confidence in your disk infrastructure, you now have the tools and knowhow to run the tests too.

So if there was a moral of the story to this post, it would be to not believe everything you read and always verify espoused reality with actual reality via testing. On that note, the next post will finish off our examination of disk performance by going over 2 additional tools that I think are particularly good for testing assumptions. After that, we will be revisiting Microsoft’s case studies, as well as some findings, insights and recommendations from some additional lab scenarios that Microsoft conducted.

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Jul 23 2012

Demystifying SharePoint Performance Management Part 8 – More on SQL and SQLIO

Send to Kindle

Well here we are at part 8 of my series on making SharePoint performance management that little bit easier to understand. What is interesting about this series is its timing. If by some minute chance that the marketing tsunami has passed you by at the time I write this, SharePoint 2013 public beta was released. Much is being made about its stated requirement of 24GB of RAM for a “Single server with a built-in database or single server that uses SQL Server”. While the reality is that requirements depends on what components that you are working with, this series of articles should be just as useful in relation to SharePoint 2013 as for any other version.

Now, if you have been following events thus far, we have been spending some time examining disk performance, as that is a very common area where a sub optimal configuration can result in a poor experience for users. In part 6, we looked at the relationship between the performance metrics of disk latency, IOPS and MBPS. We also touched on the IO characteristics (nerd speak for the manner in which something reads and writes to disk) of SQL Server and some SharePoint components. In the last post, we examined the windows performance counters that one would use to quickly monitor latency and IOPS in particular. We then finished off by taking a toe dip into the coolness of the SQLIO utility, that is a great tool for stress testing your storage infrastructure by pulverising it with different IO read and write patterns.

In this post, we will spend a bit of time taking SQLIO to the next step and I will show you how you can run a comprehensive disk infrastructure stress test. Luckily for the both of us, others have done the hard work for us and we can reap the benefits of their expertise and insights. First up however, I would like to kick things off by spending a little time showing you the relationship between SQLIO results and performance monitor counters. This helps to reinforce what the reported numbers mean.

Performance Monitor and SQLIO

In the previous post when we used Windows Performance Monitor, we plotted IOPS and Latency by watching the counters as they occurred in real-time. While this is nice for a quick analysis, nothing is actually stored for later analysis. Fortunately, performance monitor has the capability to run a trace and collect a much larger data set for a more detailed analysis later. So first up, lets use performance monitor to collect disk performance data while we run a SQLIO stress test. After the test has been run, we will then review the trace data and validate it against the results that SQLIO reports.

So go ahead and start up performance monitor (and consult part 7 of this series if you are unsure of how to do this). Looking at the top left of the performance manager, you should see several options listed under “Performance”. Click on “Data Collector Sets” and look for a sub menu called “User Defined”. Now right click on “User Defined” and choose “New –> Data Collector Set” as shown below:

image

This will start a wizard that will ask us to define what performance counters we are interested in and how often to sample performance. I have pasted screenshots of the sequence below (click to enlarge any particular one). First up we need to give a name to this collection of counters and as you can see below, I called mine “Disk IO Experiments”. Once we have given it a name, we have to choose the type of performance data we want to collect. Tick the “Performance counter” button and ensure the others are left unticked.

image  image

Next we need to pick what specific counters we need. We will use the same counters that we used in part 7, with the addition of two additional ones. To remind you of part 7, the counters we looked at were:

  • Avg. Disk sec/Read   – (measures latency by looking at how long in seconds, a read of data from the disk took)
  • Avg. Disk sec/Write  –  (measures latency by looking at how long in seconds, a write of data to the disk took)
  • Disk Reads/sec  –  (measures IOPS by looking at the rate of read operations on the disk per second)
  • Disk Writes/sec  – (measures IOPS by looking at the rate of write operations on the disk per second)

In addition to these counters, we will also add two more to the collector set

  • Avg. Disk Bytes/Read – (Measures size of each read request by reporting the number of bytes each used)
  • Avg. Disk Bytes/Write – (Measures size of each write request by reporting the number of bytes each)

We will use these counters to see if the size of the IO request than SQLIO uses is reported correctly.

Depending on your configuration, choose the PhysicalDisk or LogicalDisk  performance object (consult part 7 for the difference between PhysicalDisk and LogicalDisk). You will then find the performance counters I listed above. Before you do anything else, make sure that you pick the right disk or partition from the “Instances of selected object” section. We need to specifically pick the disk or partition that SQLIO is going to stress test. Now you select each of the aforementioned six performance counters and click the “Add” button. Finally, make sure that you pick the sample interval to be 1 second as shown below. This is really important because it makes it easy to compare to SQLIO which reports on a per second basis.

image  image

At this point you do not need to configure anything else, so click the “Finish” button, rather than the “next” button, and the collector should now be ready to go. It will not start by default, but since there is no fun in that, let’s collect some data. Right click on your shiny new data collector set and choose “Start”.

image  image

Once started, performance monitor is collecting the values of the six counters each and every second. Now let’s run a SQLIO command to give it something to measure. In this example, I am going to run SQLIO with random 8KB writes. But to make it interesting, I will use two threads and simulate 8 outstanding IO requests in the queue. If you recall by grocery check-out metaphor of part 6, this is like having 8 people with full shopping carts waiting in line for a single check-out operator. Since the guy at the back of the line has to wait for the seven people in front of him to be processed, he has to wait longer. So with eight outstanding IO requests, latency should increase as each IO request will be sitting in a queue behind the seven other requests.

By the way, if none of that made sense, then you did not read part 6 and part 7. I urge you to read them before continuing here, because I am assuming prior knowledge of SQLIO and disk latency characteristics and the big trolley theory..

Here is the SQLIO command and below is the result…

SQLIO –kW –b8 –frandom –s120 –t2 –o8 –BH –LS F:\testfile.dat

image

Now take a note of the results reported. IOPS was 526, MBs/sec was 4.11 and as expected, the average latency was much larger than the SQLIO tests we ran in part 7. In this case, latency was 29 milliseconds.

Let’s now compare this to what performance monitor captured. First up, return to Performance Monitor, and stop your data collector set by right clicking on it and choosing “Stop”. Now if you cast your eye to the top left navigation pane, you should see an option called “Reports” listed under “Performance”. Click on “Reports” and look for a sub menu called “User Defined”. Expand “User Defined” and hey presto! Your data collector set should be listed…

image  image

Expand the data collector set and you will find a report for the data you just collected. The naming convention is the server name and the date of the collection. Click on this and you will then see the performance data for that collection in the right pane. At the bottom you can see the six performance counters we chose and just by looking at the graph, you can clearly see when SQLIO started and stopped.

image

Now we have to do one additional step to make sure that we are comparing apples with apples. Performance monitor will calculate its averages based on the total time displayed. As you can see above, I did not run SQLIO straight away, but the performance counters were collected each second nonetheless. Therefore we have a heap of zero values that will bring the averages down and mislead. Fear not though, it is fairly easy (although not completely obvious) to zoom into the time we are interested in. If you look closely, just below the performance graph, where the time is reported, there is a sliding scale. If you click and drag the left and right boxes, you can highlight a specific time you are interested in. This will be shown in the performance graph too, so using this tool, we can get more specific about the time we are interested in. Then in the toolbar above the graph, you will see a zoom button. Click it and watch the magic…

image  image

As you can see below, now we are looking at the performance data for the period when the SQLIO was run. (Now it should be noted that windows performance monitor isn’t particularly granular here. I had to fiddle with the sliding scale a couple of times to accurately set the exact times when SQLIO was started and then stopped.)

image

Now let’s look at the results reported by performance monitor. The screenshot above is looking at the number of Disk Writes per second. Let’s zoom into the figures for the time period and example the average result over the sample period. To save you squinting, I have pasted it below and called out the counter in question. Performance monitor has reported average “Disk Writes/Sec” as 525.423. This is entirely consistent with SQLIO’s reported IOPS of 526.

image

Latency (reported in seconds via the counter Avg. Disk sec/Write) is also fairly consistent with SQLIO. The figure from performance monitor was 0.03 seconds (30 milliseconds). SQLIO reported 29 milliseconds.

image

What about IO size? Well, that’s what Avg disk bytes/write is for… Let’s take a look shall we? Yup.. 8192 kilobytes, which is exactly the parameters specified.

image

SQL IO characteristics revisited (and an awesome script)

Now that we understand what SQLIO is telling us via examining windows performance monitor counters, I’d like to return to the topic of SQL IO patterns. Back at the end of part 6, I spent some time talking about SQL and SharePoint IO characteristics. As a quick recap, I mentioned SQL reads and writes to databases via 8KB pages. Now based on me telling you that, you might assume that if you had to open a large document from SharePoint (say 1MB  or 1024KB), SQL would make 128 IO requests of 8KB each.

While that would be a reasonable assumption, its also wrong. You see, I also mentioned that SQL Server also has a read-ahead algorithm. This algorithm means that means SQL will try and proactively retrieve data pages that are going to be used in the immediate future. As a result, even though a single page is only 8KB, it is not unusual to see SQL read data from disk in a much wider range if it thinks the next few 8KB pages are likely to be asked for anyway. Now as an aside, if you are running SQL Enterprise edition, the possible read-ahead range is from 1 to 128 pages (other editions of SQL max out at 32 pages). Assuming SQL Enterprise edition, this translates to between 8KB and 1024KB for a single IO operation. Think about this for a second… based on the 1MB document example I used in the previous paragraph, it is technically possible that this could be serviced with a single IO request by an enterprise edition of SQL server.

Okay, so essentially SQL has varying IO characteristics when it comes to reading from and writing to databases. But there is still more to it. This is because there are a myriad of SQL IO operations that we did not even consider in part 6. As an example, we have not spoken about the IO characteristics of how SQL writes to transaction logs (which is sequential as opposed to random IO, and does not use 8k pages at all). Another little known fact with transaction logs is that SQL has to wait for them to be “flushed to stable media” before the data page itself can be flushed to stable media. This is known as Write Ahead Logging and is used for data integrity purposes. What is means though is that if logging has a lot of latency, the rest of SQL server can potentially suffer as well (and if it was not obvious before, yet another good reason why people recommend putting SQL data and log files on different disks).

Now I am not going to delve deep into SQL IO patterns any more than this, because we are now getting into serious nerdy territory. However what I will say is this: by understanding the characteristics of these IO patterns, we have the opportunity to change the parameters we pass to SQLIO and more accurately reflect real-world SQL characteristics in our testing. Luckily for all of us, others have already done the hard work in this area. First up, Bob Duffy created a table that summarises SQL Server IO patterns based on the type of operations being performed. Even better than that… Niels Grove-Rasmussen wrote a completely brilliant post, where not only did he list the IO patterns that SQL is likely to exhibit, he wrote a PowerShell script that then runs 5 minute SQLIO simulations for each and every one of them!

I have not pasted the script here, but you will find it at Niels article. What I will say though is that aside from the obvious 8KB random reads and writes that we have concentrated on thus far, Niels listed several other common SQL IO patterns that his SQLIO script tests:

  • 1 KB sequential writes to the log file (small log writes)
  • 64 KB sequential writes to the log file (bulk log writes)
  • 8 KB random reads to the log file (rollbacks)
  • 64 KB sequential writes to the data files (checkpoints, reindex, bulk inserts)
  • 64 KB sequential reads to the data files (read-ahead, reindex, checkdb)
  • 128 KB sequential reads to the data files (read-ahead, reindex, checkdb)
  • 128 KB sequential writes to the data files (bulk inserts, reindex)
  • 256 KB sequential reads to the data files (read-ahead, reindex)
  • 1024 KB sequential reads to the data files (enterprise edition read-ahead)
  • 1 MB sequential reads to the data files (backups)

The script actually handles more combinations than those listed above because it also tests for differing number of threads (-t ) and outstanding requests (-o ). All in all, over 570 combinations of IO patterns are tested. Be warned here… given that each test takes 5 minutes to run by default, with a 60 second wait time in between each test, be prepared to give this script at least 2 days to let it run its course!

The script itself is dead simple to run. Just open a powershell window, and save Niels script to the SQLIO installation folder. From there, change to that directory and issue the command:

./SQLIO_Batch.ps1

Then come back in 3 days! Seriously though, depending on your requirements, you can modify the parameters of the script to reduce the number of scenarios based on editing the first 7 lines of code which is quite self explanatory

$Drive = @('G', 'H', 'I', 'J')
$IO_Kind = @('W', 'R') # Write before read so that there is something to read.
$Threads = @(2, 4, 8)
#$Threads = @(2, 4, 8, 16, 32, 64)
$Seconds = 10*60 # Five minutes
$Factor = @('random', 'sequential')
$Outstanding = @(1, 2, 4, 8, 16, 32, 64, 128)
$BlockSize = @(1, 8, 64, 128, 256, 1024)

Now if this wasn’t cool enough, Niels also written a second script that parses the output from all of the SQLIO tests. This can produce a CSV file that allows you to perform further analysis in excel. To run this script, we need to know the same of the output file of the first script. By default the filename is SQLIO_Result.<date>.txt. For example:

./SQLIO-Parse.ps1 -ResultFileName ‘SQLIO_Result.2010-12-24.txt’

By default the parse script outputs to the screen, but modifying it to write to CSV file is really easy. All one has to do is comment out the second last line of code and uncomment the last one as shown below:

#$Sqlio | Format-Table -Property Kind,Threads,Seconds,Drive,Stripe,Outstanding,Size,IOs,MBs,Latency_min,Latency_avg,Latency_max -AutoSize

$Sqlio | Export-Csv SQLIO_Parse.csv

Below is an example of the report in Excel. Neat eh?

image

Conclusion and coming up next…

By now, you should be a SQLIO guru and have a much better idea of the sort of IO patterns that SQL Server has beyond just reading from and writing to databases. We have covered the IO patterns of transaction logs, as well as examined a terrific PowerShell script that not only runs all of the IO scenarios that you need to, but parses the output to produce a CSV file for deeper analysis. In short, you now have the tools you need to run a pretty good disk infrastructure stress test and start some interesting conversations with your storage gurus.

However at this point I feel there some pieces missing to this disk puzzle:

  1. We have not yet brought the discussion back to lead and lag indicators. So while we know how to hammer disk infrastructure, how can we be more proactive and specify minimum conditions of satisfaction for our disk infrastructure?
  2. Microsoft treatment of disk performance (and in particular IOPS and latency) in their performance documentation is inconsistent and in my opinion, confuses more than it clarifies. So in the next post, we are going to look at these two issues. In doing so, we are going to leave SQLIO and Performance Monitor behind and examine two other utilities including one that is lesser known, but highly powerful.

Until then, thanks for reading

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


« Previous PageNext Page »

Today is: Sunday 21 December 2014 |