Back to Cleverworkarounds mainpage
 

Sep 02 2013

Rethinking SharePoint Maturity Part 5: From Conditions to Actionable Lessons Learnt

Send to Kindle

Hi all

Welcome to part five of my quest to improve people’s awareness and understanding of what SharePoint maturity is really all about. For those new to this series of articles, we have traversed a bit of territory to get here, and during the journey there has been not a single SharePoint site column, content type or site collection in sight. In fact, I have not touched any of the topics that many would traditionally view as a sign of SharePoint maturity. Instead, I have been taking readers on a fun-filled journey examining three nerdy, yet highly interesting areas of research in team development, collaboration and organisational learning. Along the way we defaced the Mona Lisa, looked at SharePoint through holes in slices of Swiss Cheese and channelled the number of the beast.

After all that, we ended part 4, by arriving at this odd looking diagram below…

What you are looking at is something called the CALL model, which stands for “Conditions to Actionable Lessons Learnt”. I originally developed the model with Dialogue Mapping and knowledge management in mind – essentially to help my clients do a better job of integrating double loop learning into their projects. However, it soon became apparent that it was valuable in various SharePoint contexts too.

Single vs. double loop learning

In the previous paragraph I made a reference to “double loop learning” without explaining what it was, so let’s quickly make amends because it is interesting stuff. The idea of single and double loop learning has been around for close to 40 years – it was in 1974 when Chris Argyris came up with the idea. To explain, let’s bring back our trusty SharePoint 2010 governance poster that I trash-talked in the first and fourth articles…

If you have not seen this poster before, it represents what Microsoft believe to be the focus areas for SharePoint 2010 governance. Many people – consultants in particular – will take the information in this poster for granted and create SharePoint governance plans that try to cover off the various areas it suggests to be covered. Everyone will feel good because they have ticked all the boxes of this authoritative fountain of SharePoint wisdom.

Then, if SharePoint then fails to live up to expectations, many will look at the poster and wonder which areas they did not adequately cover. They will study the poster, search Google or Wikipedia for better definitions of the terms listed and then make another reattempt, trying to do an even better job of implementing the wisdom contained therein.

This my friends, is a shining example of single loop learning. Single loop learning, as described in this article “seems to be present when goals, values, frameworks and, to a significant extent, strategies are taken for granted. The emphasis is on techniques and making techniques more efficient.” In single loop learning, the fundamental premise of a course of action remains unchanged. All of the energy of learning is directed to making sure “we get it right this time.” In short, in a single loop learning scenario, repeated attempts are made at solving the same issue, but no-one questions the underlying premise of the strategy.

Now in case you haven’t noticed, I spent the first three posts of this series “questioning the underlying premise” of the above SharePoint governance poster. So in effect I’ve been introducing you to the notion of double loop learning already. Double-loop learning involves taking a deeper look at what is going on. In double-loop learning, having attempted to achieve a goal on different occasions, the goal itself may be modified, re-framed or rejected in the light of the experience gained in trying to achieve it. Think about it – double loop learning actually happened in organisations people would never say things like “well that’s always how we have done it here.”

I see a lot of single loop learning in SharePoint land, and I want to help people break out of their existing framing of the issue – compassionately, of course Smile

Enter the CALL Model

So getting back to my CALL model, I propose it as a multi-purpose tool that can be used for various SharePoint related stuff. It is based on the Swiss cheese risk management model; a metaphor which suggests most strategies have gaps that create risk. These gaps are analogous to holes in slices of Swiss cheese. In terms of the SharePoint governance poster, think of each of the areas it suggests to be covered as slices of cheese. This key to this model is that it assumes that no single defence layer is sufficient to mitigate risk. It also implies that if risk mitigation strategies are set up with all the holes lined up, there is a systematic flaw, since it would allow a problem to progress all the way through to adversely affect the organisation. Accordingly, the Swiss cheese model encourages a more balanced view of how risk is managed.

You can think of the CALL model as a SharePoint optimised Swiss cheese model. CALL extends the Swiss cheese model by incorporating cutting edge research in enabling team performance (Hackman), collaboration (Wilder) and knowledge management (Duffield). It outlines 8 actionable areas (Swiss cheese slices) that operate at the individual, team and organisation levels. These focus areas can be thought of as enabling conditions that mitigate risk, as well as focus areas for identification and application of lessons learnt. In other words, my contention is that for SharePoint maturity, you should strive to create these 8 conditions and then consider them when evaluating project performance.

image

The image above is another drawing of the model minus the pretty colours I used earlier. In this version, I am showing the path of a SharePoint project flows through these 8 areas. Note how the arrow from left to right deviates because we are seeking to use them to mitigate risk via defence in depth. But when it comes to applying learnings from a project (arrow now moves from right to left to close the loop), the flow is designed to be smooth and unencumbered to ensure that the opportunity for double loop learning takes place.

Here is a description of each of the 8 focus areas:

Skills and expertise

This focus area is concerned with ensuring individuals are selected with the right skills and task expertise to perform their role in delivery and operation of services. In SharePoint this is critical because of the technical depth and breadth of the product. Want to deploy SharePoint 2013 request routing in dedicated mode? Go see Spence so he can tell you not to. Want to learn how the new WOPI protocol works with Office Web Apps? Sign a cheque for Wictor to help you.

Skills is closely associated with high IQ. In other words, specialist skills require smart, dedicated people. Therefore this also incorporates ensuring staff have appropriate qualifications and certifications, that education, training and ongoing development practices are properly targeted, and that individuals are willing to learn new skills and be proactive in keeping themselves up-skilled. (In other words, all of the hallmarks of those brilliantly talented people who completed the now defunct Microsoft Certified Masters program).

Collaborative Maturity

Ever heard of the term “dumb smart guy”? Usually it is someone who is intellectually smart, but has all the emotional maturity (EQ) of a potato. Collaborative maturity is all about ensuring that individuals have skills in working collaboratively with others. It signifies a willingness to listen, empathy, mutual respect, understanding and trust. Collaboratively mature people have a tolerance for ambiguity and have the ability to engage in genuine dialogue to reach compromise. Collaboratively mature people also see collaboration in their self-interest and foster develop deep ties with colleagues in order to work interdependently.

Being in the IT industry, I’m not sure if this person actually exists, but hey – this description gives us all something to aspire to!

Role clarity

Role clarity is concerned that the role of each team member is understood by everyone within the team and it is clear how much authority is vested within each role. This in turn provides task clarity, fosters interdependency among a team and reduces process loss. (Process loss is difficulty in knowing who is doing what and how it is done). Where roles are clear and understood, team members are appropriately appointed to tasks according to their capacity (see “skills and expertise” above) and character (see “collaborative maturity” above).

Goal clarity

Goal clarity relates to purpose, direction and goal alignment between members of a team is essential for good team performance. A compelling purpose energizes team members, orients them toward their collective objective, fully engages their talents and motivates them to resolve conflicts. A compelling purpose should be underpinned by concrete, attainable goals and objectives, both short and long term. Knowing where you are heading focuses the team’s energy in directive meaningful activity. This also helps build team efficacy, which is the belief within teams of their ability to solve problems and deliver great solutions. On the other hand, lack of goal clarify is one of the classic symptoms of wicked problems.

Participation safety and decision influence

Ever been on a project that’s taking on water, but nobody seems to be willing to listen? Ever had any critically important topics not discussed because they are simply taboo and unmentionable? It is not fun – and little breakthrough thinking or innovation can exist without participation safety and decision influence. When a team has a high level of participation safety, members feel safe to share ideas, raise unpopular views or opinions, or speak their truth to one another. This reduces groupthink and social loafing, encourages breakthrough and can lead a collaborative team and a collaborative organisational culture. There are countless case studies of major disasters (such as the Deepwater Horizon Oil Spill), where a culture of “only tell me the good news” prevented critical information from being raised that could have averted the issue. In fact ‘communication’ (or lack of it) is probably the most commonly cited project failure factor.

Having said all that, while participation safety is critical, the ideas that team members put forward need to influence the direction and outcome of the team. A manager who says “my door is always open”, but then ignores feedback creates dissonance among team members because what is espoused is not practiced. The simple facts of the matter is that a key element for peak performance is to provide an environment safe enough for team members to speak their truths, to be rewarded for doing so and for truth telling to actually influence direction.

Enabling Technology

Technology underpins all aspects of organisational systems and projects and provides the means to generate leaps in performance and capabilities of users, as more broadly, team and organisational productivity. Technology at its best facilitates the delivery of timely, relevant information for decision making, co-ordination and collaboration. Thus it is critical that technology does not get in the way of delivering value. How often have you worked on a project when you have been forced to use technologies that stifle productivity, create frustration and reduce collaboration between team members? How often has that technology been SharePoint?

Enabling Process

How often have you said to yourself “I can’t believe I have to follow this braindead process.” Process is the glue that provides the rules of behaviour in delivering on goals and like technology, underpins all aspects of organisational systems and projects and is a key part of performance and productivity. It is critical that process, like technology, is always driven by purpose and that it does not get in the way of delivering value. Inappropriate process can make a huge difference in how team members interact with stakeholders and each-other.

Enabling Resources

Enabling resources is concerned with the financial, material and human input necessary to develop and sustain delivery of services. Put simply, even the best teams with the most compelling direction can falter if they are under-resourced. It is critical that sufficient funds, staff, materials and time are provided to get the job done.

Applications for the CALL model

The CALL model can be used in many ways, given its heritage of Hackman, Wilder and Duffield. Examples include:

  • A model for performing SharePoint governance health check/assessment
  • A model for assessing the makeup of a SharePoint team
  • A model for assessing the complexity of a proposed SharePoint solution
  • A model for assessing departmental readiness for SharePoint
  • A model for developing SharePoint Business Continuity planning

It is worthwhile noting that Hackman developed a team performance instrument called a Team Diagnostic Survey based around his 6 enabling conditions. Since the CALL model is so closely aligned to Hackman, it should also be able to be used in a similar fashion. The same goes for the Wilder research on collaboration, that developed an instrument called the Collaboration Factors Inventory.

So given the source material, the CALL model also has applicability in the areas of:

  • An set of enabling conditions to establish to develop high performing teams
  • An set of enabling conditions to successful collaborative delivery of projects
  • A focus area for the identification of risks (and opportunities) on organisational initiatives
  • A framework for the systematic capture of project lessons learnt
  • A framework for assessing change and other organisational initiatives
  • A governance maturity model
  • A knowledge management and organisational learning maturity model

Conclusion

The CALL model reflects a synthesis of three highly rigorous research efforts. All three seemed to gel really well when they were put into the melting pot and I was pleased with the result. In the next post, I will show you how I have used the CALL model when developing a SharePoint Business Continuity Strategy for a client. I’ll also talk about how I have used it in lessons learnt workshops.

Thanks for reading

 

Paul Culmsee

www.hereticsguidebooks.com

HGBP_Cover-236x300

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Aug 28 2013

Rethinking SharePoint Maturity Part 4: The number of the beast…

Send to Kindle

Hi all

This is part four of a series I am writing on my ideas about how the SharePoint community can revitalise how SharePoint maturity is understood and communicated. If this is your first time reading this series, then I urge you to go back and read the first three posts in the series because I covered a bit of ground to get to here. For those of you who will not do that, here is a super quick recap.

As part of doing some book research, I came across three distinct bodies of work that I think can help Microsoft and their customers better understand, foster and cultivate SharePoint maturity. To set the scene, I started this series by taking a few cheap potshots at the SharePoint 2010 governance poster that many people use as a guide to ensure they are doing SharePoint “right”. I highlighted some of the less obvious risks with it, by comparing it to a paint by numbers representation of the Mona Lisa. While people seem to understand implicitly that colouring in a Mona Lisa template will not make it a Mona Lisa (far from it), there same cannot be said for our SharePoint paint by numbers poster. The result is a skewed version how SharePoint ‘maturity’ is perceived, because this poster ignores some critical factors.

image_thumb2image_thumb5

To shine a spotlight on areas not covered by the above poster, I introduced you to the work of JR Hackman, who’s pioneering work in the area of leadership and team development contains important lessons for SharePoint teams. Hackman urges people to stop trying to look life through simplistic cause and effect lens of “If you do Z, you will get Y” and instead focus on the conditions that enable or disable success. Hackman, in his examination of leadership and the performance of teams, listed six conditions that he felt led to better results if they were in place (none of which make much of an appearance on the above poster):

  1. A real team: Interdependence among members, clear boundaries distinguishing members from non-members and moderate stability of membership over time
  2. A compelling purpose: A purpose that is clear, challenging, and consequential. It energizes team members  and fully engages their talents
  3. Right people: People who had task expertise, self organised and skill in working collaboratively with others
  4. Clear norms of conduct: Team understands clearly what behaviours are, and are not, acceptable
  5. A supportive organisational context: The team has the resources it needs and the reward system provides recognition and positive consequences for excellent team performance
  6. Appropriate coaching: The right sort of coaching for the team was provided at the right time

I sometimes challenge agile development evangelists by saying “If Scrum was ‘the answer’ then there would be no need for Scrum coaches.” When you speak to good agile coaches, what they strive to do is create the sort of enabling conditions conducive to getting the best out of Agile – exactly what Hackman is urging us to do. Hackman’s insight is super important because It is incredibly common for people to say things like “now this will work for the right organisation…” or “this will work provided the right culture is there to support it,” but they do not elaborate further on what “right” actually is. Hackman challenges us to actually focus on the enabling conditions that underpinning a process, not the other way around.

To see if Hackman’s 6 conditions were applicable outside of his interest area of team development, I then examined the work done by the Wilder Research Group. This group published a book “Collaboration: What Makes It Work” which distilled the wisdom from 281 research studies on successful collaboration. Importantly for me, they looked at very different research than Hackman did, yet also broke it down to six quite similar focus areas:

  1. Membership characteristics: Skills, attributes and opinions of individuals as a collaborative group, as well as culture and capacity of orgs that form collaborative groups
  2. Purpose: The reasons for the collaborative effort, the result or vision being sought
  3. Process and structure: Management, decision making and operational systems of a collaborative context
  4. Communication: The channels used by partners to exchange information, keep each-other informed and convey opinions to influence
  5. Environment: Geo-location and social context where a collaborative group exists. While they can influence, they cannot control
  6. Resources: The financial and human input necessary to develop and sustain a collaborative group

Finally, I examined the work of Stephen Duffield, who took the Swiss Cheese model for risk management (which is popular in safety circles) and adapted it for organisational learning for projects. This is brilliant because the Swiss cheese model assumes that any one mitigation strategy has holes in it (hence the Swiss cheese metaphor). If you consider the SharePoint governance poster above as listing particular slices of cheese, then if you still have not met with the success you were hoping for, some important slices of cheese must be missing. So what are they?

To develop his project learning model (called SYLLK), Duffield distilled the wisdom over 500 papers on the topics of project lessons learned, knowledge management and risk management. Like above two research efforts, he also distilled it down to six key areas. Duffield’s slices of cheese were:

  1. Learning: Whether individuals on the team were skilled, had the right skills for their role and whether they were kept up-skilled
  2. Culture: What participants do, what role they fulfil, how an atmosphere of trust is developed in which people are encouraged, even rewarded for truth telling– but in which they are also clear about where the line must be drawn between acceptable and unacceptable behaviour”
  3. Social: How people relate to each-other, their interdependence and how they operate as a team
  4. Technology: Ensuring that technology and data supports outcomes and does not get in the way
  5. Process: Ensuring the appropriate protocols drive people’s behaviour and inform what they do (gate, checklists, etc.)
  6. Infrastructure: Environment (in terms of structure and facilities) that enable project outcomes

Why does all this stuff matter?

Now I am hoping this is not the case, but I would not be surprised that some readers have gotten to this point and are thinking “What the hell does all this have to do with SharePoint maturity?” (particularly if you have not read the 3 articles that preceded this one). So let’s address this one now…

Since 2009 I have taught a SharePoint Governance and Information Architecture class to BAs, PMs, CIOs, consultants, as well as developers and IT Pros. I’ve taught the class around the world and in every class I start by asking students to articulate what they feel is the hardest thing about SharePoint delivery. Take a look at the answers for yourself…

A Brisbane 2012 class said:

  • Explaining what SharePoint is
  • User uptake (“People do not like new things”)
  • Managing proliferation of SharePoint sites
  • Too much IT ownership (“Sick of IT people telling me that SP is the solution”)
  • Users don’t know what they want
  • Difficulties around SP ownership because of a lack of accountability

Not too many technical answers here it seems – in fact I am seeing connections to Hackman, Wilder and Duffield already. Looking at the seven points, they indicate we are missing some key enabling conditions like a compelling purpose, role clarity as well as the collaborative skills and attributes needed in the SharePoint team to address them (“User uptake” and “Difficulties around SP ownership because of a lack of accountability”). Perhaps we are also missing product skills (“proliferation of SharePoint sites”) and have an issue with process and structure, given the complaint of “too much IT ownership”.

Not convinced? For what it’s worth, Melbourne 2010 answered with:

  • Every project is “new” (“Traditional ASP.NET web site development is ‘same old same old’)
  • The solution is never the same as the initial design and the end client may not realise this. The implication is gaps between expectation and delivery
  • Stakeholders don’t know what they want (“First time around what they sign off on is not what they want “)
  • Projects launched as “IT projects” with no clear deliverable and no success indicators
  • Lack of visibility as to what other organisations are doing
  • Determining limits and boundaries (“Doing anything ‘practically’ in SharePoint is hard”).
  • Managing expectations around SharePoint.
    • Clients with no experience think it can do everything
    • Difficulties getting information from and translating into design, so it can be implemented
  • Legacy of bad implementations makes it hard to win the business owner
  • Lack of governance
    • Viral spread of unmanaged sites
    • No proper requirements of “why”
    • No-one managing it

… and if you want to move further afield, Singapore 2012 said:

  • Trying to deal with the sheer number of features
  • “A totally different kind of concept”
    • A little knowledge can be dangerous
    • If you start with the wrong footing, you end up messing it up
  • Trying to deal with “I need SharePoint”
  • SharePoint for an external web site was difficult to use. Unfriendly structure for a public facing website
  • Trying to get users to use it (Steep learning curve for users)
  • The need for “deep discussion” to ensure SharePoint is put in for the right reasons. Without this, the result is messy, disorganised portals
  • The gap between the business and IT results in a sub optimal deployment
  • Demonstrating value to the business (SharePoint installed, but its potential is not being realized)
  • Stakeholders not appreciating the implication of product versus platform
  • You are working across the entire business (The disconnect between management/coalface)
  • “Everything hurts with SharePoint”
  • Facilitating the discussion at the business level is hard when your background is IT

Once again, if you start to distil the underlying themes behind the above answers, you can start to see how Hackman’s conditions are not met and how we are missing some of Duffield’s slices of cheese. So at this point if you are not convinced of the relevance of Hackman, Wilder and Duffield’s research by now, then I can confidently tell you two things…

  • 1) I am not the SharePoint consultant that you need!
  • 2) I am eventually going to become the SharePoint consultant you need! (You will see the light eventually Smile)

The number of the beast…

Hopefully by now I have given you an appreciation of Hackman, Wilder and Duffield’s work and its relevance to SharePoint maturity. All of them have made highly rigorous research efforts, each asking different questions and utilising different research. Those of you who are more superstitiously minded might feel that I am meddling with the forces of darkness here, given that each of the three distilled 6 themes – 666 Surprised smile

While there is no fire and brimstone nearby as I write this text, my next job was to indeed meddle with dark forces in the sense that I had to perform my own analysis of their work to draw out the commonalities and gaps. I figured that this would go a long way to laying the groundwork for gaps I see in SharePoint maturity. I won’t bore you with too much of the detail on this effort, except to say that I used Dialogue Mapping techniques to perform the synthesis. Below are a couple of screenshots illustrating the maps I built which give you a feel for what was done (click to enlarge)…

 

In a nutshell, I mapped out each authors six constructs and then mapped their behind the scenes detail. As expected, while some terminology differed somewhat , by in large there was a lot of commonality. I linked all the commonalities together in my maps and slowly, a new picture began to emerge…

Meddling with forces…

When you think about it, my synthesis of this research is not all that different from what SharePoint people do all the time when developing an information architecture to store organisational data in a meaningful and intuitive way. Just like when you have to determine an appropriate navigation structure for your SharePoint sites, with this work I had to first think about the basis for how I wanted to structure things.

First up, I found Duffield’s use of Swiss cheese model for risk inspiring and absolutely applicable for SharePoint maturity. The metaphor of holes in each slice of SharePoint governance “cheese” is vivid and confronting. The commonality in the answers I got to the above “What is the hardest thing about SharePoint?” question speaks to the fact that there are universally common missing slices of SharePoint governance cheese. I also felt strongly that there was huge potential for a clearer relationship between Hackman’s enabling conditions and Duffield’s lessons learnt. My logic was pretty basic here: Surely organisational lessons learnt (if actually learnt), should be enabling conditions next time around? It seemed to me that Hackman “conditions” and Duffield “lessons” are looking at the same thing, just from different points in time. So my intention was to develop my own Swiss cheese model using constructs (navigation labels if you will) that could be lessons learned focus areas as well as enabling conditions. That way one could truly gauge if lessons really had been learnt, because by definition they would be the enabling conditions to strive for next time around.

The other thing I was looking to do was to make things more actionable. Saying things like “have the right people”, “the right membership characteristics” or “the right culture” are not easily actionable because what exactly is “right” anyway? To that end, I think all of the authors suffered from creating constructs that were fine for their purposes, but were a bit wishy-washy when trying to be more directive and action-focused.

One particular anomaly I noticed when comparing the research was the lack of “purpose” as a construct in Duffield’s work. Hackman and Wilder both list “purpose” as one of their six factors, but Duffield’s SYLLK model makes no mention of purpose at all. I felt this needed to be addressed because lack of purpose is one of the classic symptoms of a wicked problem – a topic I have been writing about for years now. So for me, clarity of purpose is a very big slice of Swiss cheese!

I also felt that Hackman was a bit weak on some of the process/system areas. Duffield lists “process”, “technology” and “infrastructure” as key focus areas for lessons learnt on projects and the closest Hackman comes to that is “Supportive organizational context”. This is understandable when you remember that Hackman was talking only about teamwork in general. Nevertheless for our SharePoint context, I thought that Duffield was closer to the mark. Wilder turned out to be a bit in between – as they list “Process and structure” and “Resources” as two of their six areas.

There were other various things that influenced my synthesis as well, but none of them matter for this discussion. I guess you want to see the results no?

The CALL model

The result of this work is a model I have called the CALL model (Conditions to Actionable Lessons Learnt). Since this post is getting long, I will defer a detailed description of the model for the next article. But to whet your appetite, below is an image that shows what I ended up with. It adopts Duffield’s SYLLK model as its base, but highlights 8 enabling conditions (or learning areas), split across individual, team and organisation lines. Additionally, distinction is also made between the soft (people) factors and the harder (system) factors. The arrows are there to help convey the “defence in depth” idea of the Swiss cheese model.

If you go back to the start of this article and examine Hackman, Wilder and Duffield’s work, you will see all of them represented here, except I used more action focused labels. My contention is that these eight areas are what you should be considering when judging your organisation’s SharePoint maturity. The importance of considering these as enabling conditions, to be put in place right at the start cannot be underestimated. Going back to Hackman, here is what he said about the importance of enabling conditions for team performance (emphasis mine):

Let me go out on a limb and make rough estimates of the size of these effects. I propose that 60 per cent of the difference in how well a group eventually does is determined by the quality of the condition-setting prework the leader does. Thirty per cent is determined by how the initial launch of the group goes. And only 10 per cent is determined by what the leader does after the group is already underway with its work. This view stands in stark contrast to popular images of group leadership—the conductor waving a baton throughout a musical performance or an athletic coach shouting instructions from the sidelines during a game

I note that Hackman’s view doesn’t just stand in stark contrast to “popular images of group leadership” – it also stands in stark contrast to Microsoft’s SharePoint governance poster too!

Conclusion

Now I am sure that some of you are still trying to decipher the model above, or feel the urge to tell me that something is missing or that the labels for each of my slices of SharePoint maturity cheese don’t do it for you. In the next post I will spend more time going over the model and what each slice really means. Given the heritage of the source material that inspired it, I feel that it can be used in various SharePoint and non-SharePoint scenarios. So I will also spend some time talking about that.

Finally, I don’t know if any of you have noticed, but I didn’t actually develop this model just for SharePoint! In fact I have already used this model in various non-SharePoint ways too. We will also take a look at this…

thanks for reading

 

 

Paul Culmsee

www.hereticsguidebooks.com

HGBP_Cover-236x300

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Aug 26 2013

Rethinking SharePoint Maturity Part 3: Who moved my cheese?

Send to Kindle

Hi all

Welcome to part 3 in this series about rethinking what SharePoint “maturity” looks like. In the first post, I introduced the work of JR Hackman and his notion of trying to create enabling conditions, rather than attribute cause and effect. Hackman, in his examination of leadership and the performance of teams, listed six conditions that he felt led to better results if they were in place. Those conditions were:

  1. A real team: Interdependence among members, clear boundaries distinguishing members from non-members and moderate stability of membership over time
  2. A compelling purpose: A purpose that is clear, challenging, and consequential. It energizes team members  and fully engages their talents
  3. Right people: People who had task expertise, self organised and skill in working collaboratively with others
  4. Clear norms of conduct: Team understands clearly what behaviours are, and are not, acceptable
  5. A supportive organisational context: The team has the resources it needs and the reward system provides recognition and positive consequences for excellent team performance
  6. Appropriate coaching: The right sort of coaching for the team was provided at the right time

I then got interested in how applicable these conditions were to SharePoint projects. The first question I asked myself was “I wonder if Hackman’s conditions apply to collaboration itself, as opposed to teams.” To find out, I utilised some really interesting work done by the Wilder Research Group, that produced a book called “Collaboration: What Makes It Work.” This book distilled the wisdom from 281 research studies on collaborative case studies and their success or failure. They distilled things down to six focus areas (they ended up with the same number as Hackman). Their six were:

  1. Membership characteristics: (Skills, attributes and opinions of individuals as a collaborative group, as well as culture and capacity of orgs that form collaborative groups)
  2. Purpose: (The reasons for the collaborative effort, the result or vision being sought)
  3. Process and structure: (Management, decision making and operational systems of a collaborative context)
  4. Communication: (The channels used by partners to exchange information, keep each-other informed and convey opinions to influence)
  5. Environment: (Geo-location and social context where a collaborative group exists. While they can influence, they cannot control)
  6. Resources: (The financial and human input necessary to develop and sustain a collaborative group)

If you want the fuller detail of Hackman and Wilder, check the first and second posts respectively. But it should be clear from even a cursory look at the above lists, that there is a lot of overlap and common themes between these two research efforts and we can learn from them in our SharePoint work. I strongly believe that this sort of material constitutes a critical gap in a lot of the material out there on what it takes to have a successful SharePoint deployment and offers some excellent ideas in further developing ideas around SharePoint maturity. I started to develop a fairly comprehensive Dialogue Map of both of these research efforts so I could synthesise them to create my own set of “conditions” in the way Hackman describes. While I was doing this, I met a fellow via LinkedIn who opened my mind to further possibilities. Everybody, meet Stephen Duffield

Duffield’s SYLLK model for lessons learnt

I met Steve because we both shared a common interest in organisational knowledge management. In, fact Steve is working on his PhD in this area, focussing on addressing the pitiful record of organisations utilising lesson learnt practices on projects and then embedding them into organisational  culture and practices. If you have ever filled out a lessons learnt form, knowing full-well that it will disappear into a filing cabinet never to be seen again, Steve shares your frustration. For his PhD, he is tackling two research questions:

  1. What are the significant factors that negatively influence the capture, dissemination and application of lessons learned from completed projects within project-based organisations?
  2. Can a systemic knowledge model positively influence the capture, dissemination and application of project management lessons learned between project teams within the organisation?

Now if you think it was impressive that Wilder researched 281 studies on collaboration, Steve topped them by miles. His PhD literature review covered over 500+ papers on the topics of project lessons learned, knowledge management, risk management and the like. 500! Man, that’s crazy – all I can say to that is I am sure as hell glad he did it and I didn’t have to!

So what was the result of Duffield’s work? In a nutshell, he has developed a model called “Systemic Lessons Learned Knowledge” (SYLLK), which was influenced by the Swiss Cheese model for risk management, originally proposed by Dante Orlandella and James T. Reason.

Why SYLLK is important for SharePoint

imageBefore I explain Duffield’s SYLLK model, it is important I briefly explain the Swiss Cheese model for risk management that inspired him. The Swiss Cheese Model (see the image to the left) for risk management is commonly used in aviation and healthcare safety. It is based on the notion that systems have potential for failure in many areas and these are analogous to a stack of slices of Swiss cheese, where the holes in each slice are opportunities for a process to fail. Each of the slices are “defensive layers” and while an error may allow a problem to pass through a hole in one layer, in the next layer the holes are in different places, allowing the problem to be caught before its impact becomes severe.

The key to the Swiss Cheese Model is that it assumes that no single defence layer is sufficient to mitigate risk. It also implies that if risk mitigation strategies exist, yet all of the holes are lined up, this is an inherently flawed system. Why? because it would allow a problem to progress through all controls and adversely affect the organisation. Therefore, its use encourages a more balanced view of how risks are identified and managed.

So think about that for a second… SharePoint projects to this day remain difficult to get right. If you are on your third attempt at SharePoint, then by definition you’ve had previous failed SharePoint projects. The inference when applying the Swiss cheese model is that your delivery approach is inherently flawed and you have not sufficiently learnt from it. In other words, you were – and maybe still are – missing some important slices of cheese from your arsenal. From a SharePoint maturity perspective, we need to know what those missing slices are if we wish to raise the bar.

So the challenge I have for you is this: If you have had a failed or semi-failed SharePoint project or two under your belt, did you or others on your team ever say to yourself “We’ll get it right this time” and then find that the results never met expectations? If you did, then Duffield’s (and my) contention is you might have failed to truly understand the factors that caused the failure.

Back to Duffield…

This is where Duffield’s work gets super interesting. He realised that the original Swiss cheese “slices” that resolved around safety were inappropriate for a typical organisation managing their projects. Like the Wilder work on collaboration, Steve reviewed tons of literature and synthesised from it, what he thinks are the key slices of cheese that are required to enable not only mitigation of project risks, but also focus people on the critical areas that need to be examined to capture the full gamut of lessons learnt on projects.

So how many slices of cheese do you think Steve came up with? If you read the previous two posts then you can already guess at the answer. Six!

There really seems to be something special about the number 6! We have Hackman coming up with 6 conditions for high performing teams, Wilder’s 6 factors that make a difference in successful collaboration and Duffield’s 6 areas that are critical to organisational learning from projects! For the record, here are Duffield’s six areas (the first three are labelled as people factors and the second three are system factors):

  1. Learning: Whether individuals on the team are skilled, have the right skills for their role and whether they are kept up-skilled
  2. Culture: What participants do, what role they fulfil, how an atmosphere of trust is developed in which people are encouraged, even rewarded for truth telling– but in which they are also clear about where the line must be drawn between acceptable and unacceptable behaviour”
  3. Social: How people relate to each-other, their interdependence and how they operate as a team
  4. Technology: Ensuring that technology and data supports outcomes and does not get in the way
  5. Process: Ensuring the appropriate protocols drive people’s behaviour and inform what they do (gate, checklists, etc.)
  6. Infrastructure: Environment (in terms of structure and facilities) that enable project outcomes

Duffield has a diagram that illustrates the SYLLK model, showing how his six identified organisational elements of learning, culture, social, technology, process and infrastructure align as Swiss cheese slices. I have pasted it (with permission), below (click to enlarge).

Duffield states that the SYLLK model represents “the various organisational systems that collectively form the overall behaviour of the organisation. The various modes of social and cultural learning, along with the organisational processes, infrastructure and technology that support them.” Notice in the above diagram how the holes in each slice are not lined up when the project arrow moves right to left. This makes sense because the whole point of the model is the idea of “defence in depth.” But then the holes are aligned when moving from left to right. This is because each slice of cheese need to be aligned to enable the feedback loop – the effective dissemination and application of the identified lessons.

Conclusion

The notion of the Swiss cheese model for mitigating risk makes a heck of a lot of sense for SharePoint projects, given that

  • a) there is a myriad of technical and non technical factors that have to be aligned for sustained SharePoint success, and
  • b) SharePoint success remains persistently illusive for many organisations.

What Duffield has done with the SYLLK model is to take the Swiss Cheese model out of the cloistered confines of safety management and into organisational learning through projects. This is huge in my opinion, and creates a platform for lots of innovative approaches around the capture and use of organisational learning, all the while framing it around the key project management task of identifying and mitigating risk. From a SharePoint maturity perspective, it gives us a very powerful approach to see various aspects of SharePoint project delivery in a whole new light, giving focus to aspects that are often not given due consideration.

Like the Wilder model, I love the fact that Duffield has done such a systematic and rigorous review of literature and I also love the fact that his area of research is quite distinct from Hackman (conditions that enable team efficacy) and the Wilder team (factors influencing successful collaboration). When you think about it, each of the three research efforts focuses on distinct areas of the life-cycle of a project. Hackman looks at the enabling conditions required before you commence a project and what needs to be maintained. Wilder appears to focus more on what is happening during a project, by examining what successful collaboration looks like. Duffield then looks at the result of a project in terms of the lessons learnt and how this can shape future projects (which brings us back to Hackman’s enabling conditions).

While all that is interesting and valuable, the honest truth is that I liked the fact that all three of these efforts all ended up with six “things”. It seemed preordained for me to “munge” them together to see what they collectively tell us about SharePoint maturity.

… and that’s precisely what I did. In the next post we will examine the results.

 

Thanks for reading

 

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Mar 13 2013

Making Sense of SharePoint and Digital Records Management…

Send to Kindle

Hi all

One of the conversation areas in SharePoint life that is inevitably complex is that of records management since there are just as many differing opinions on records management as there are legal jurisdictions and different standards to choose from. Accordingly, a lot of confusion abounds as we move into a world dominated by cloud computing, inter-agency collaboration, changes in attitudes to information assets via the open data/government 2.0 movements, and of course, the increasing usage of enterprise collaboration systems like SharePoint. As a result, I feel for record managers because generally they are an unloved lot and it is not really their fault. They have to meet legal compliance requirements governed by various acts of legislation, but their job is made all the harder by the paradox that the more one tries to enforce compliance, the less likely one is to be compliant. This is because more compliance generally equates to more effort on the part of users for little perceived benefit. This results in direct avoidance of using record management systems or the plain misuse of those systems (both which in turn results in a lack of compliance).

As it happens, my company works with many government agencies primarily in the state of Western Australia, both at a state agency and local government level. We have seen most enterprise document management systems out there such as HP Trim, Objective, Hummingbird/OpenText and have to field questions on how SharePoint should integrate and interact with them (little known fact – I started my career with Hummingbird in 1998 when it was called PCDOCS Open and before SharePoint existed).

Now while I am sympathetic to the plight of your average records management professional, I have also seen the other side of the coin, where records management is used to create fear, uncertainty and doubt. “You can’t do that, because of the records act” is a refrain that is oft-levelled at initiatives like SharePoint or cloud based solutions to try and shut them down or curtail their scope. What makes it hard to argue against such statements is that few ever read such acts (including those who make these sort of statements). So being a sucker for punishment, I decided to read the Western Australian State Records Act 2000 and the associated standard on digital recordkeeping, published by the State Records Office. My goal was to understand the intent of these standards and the minimum compliance requirements they mandate, so I could better help clients integrate potentially disruptive tools into their compliance strategies.

I did this by starting out with the core standard in Western Australia – SRC Standard 8: Digital Recordkeeping. I created an IBIS Issue Map of this standard using Compendium software. What I soon discovered was that Standard 8 refers to other standards, such as Standard 2: Recordkeeping Plans and Standard 3: Appraisal of Records. That meant that I had to add these to the map, as well as any other documents they referred to. In the end, I followed every standard, policy or guideline in a recursive fashion, until I was back at the digital recordkeeping standard where I started. This took a while, but I eventually got there. You can click the image below to examine the standards in all of their detail and watch the video to see more about how I created it.

Map   

Now I need to make it clear that my map is not endorsed by the State Records Office, so it is provided as-is with a disclaimer that it is not intended to drive policy or be used as anything more than an example of the mapping approaches I use. I felt that by putting the standards into a IBIS based issue map, I feel I was able to reduce some of the complexity of understanding them, because now one can visually see how the standards relate to each other. Additionally, by taking advantage of Compendiums ability to have the same node in multiple maps, it allowed me to create a single ‘meta map’ that pulled in all of the compliance requirements into a single integrated place. One can look at the compliance requirements of all the standards in one place and ask themselves “Am I meeting the intent of these standards?”

Reflections…

In terms of my conclusions undertaking this work, there are a few. For a start, everything is a record, so people should just get over the whole debate of “is it or isn’t it”. In short, if you work for a government agency and are doing actual work, then your work outputs are records. The issue is not what is and is not a record, but how you control and manage them. Secondly, the notion that there has to be “one RMS system to rule them all” to ensure compliance is plain rubbish and does not stand up to any form of serious scrutiny. While it is highly desirable to have a single management point for digital recordkeeping, it is often not practical and insistence in doing this often makes agencies less compliant because of the aforementioned difficulties of use, resulting in passive resistance and outright subversion of such systems. It additionally causes all sorts of unnecessary stress in the areas of new initiatives or inter-agency collaboration efforts. In fact, to meet the intent of the standards I mapped, one by definition, has to take a portfolio approach to the management of records as data will reside in multiple repositories. It was Andrew Jolly who first suggested the portfolio idea to me and provided this excellent example: There is nothing stopping records management departments designating MS Exchange 2013 Site mailboxes as part of the records management portfolio and at the same time having a much better integrated email and document management story for users.

For me, the real crux of the digital records management challenge is hidden away in SRC Standard 8, Principle 5 (preservation). One of the statements of compliance in relation to preservation is that “digital records and their metadata remain accessible and usable for as long as they are required in accordance with an approved disposal authority.”  In my opinion, the key challenge for agencies and consultancies alike is being able to meet the requirements of Disposal Authorities (DA’s) without over burdening users. DA’s are the legal documents published by the State Records Commission that specify how data is handled in terms of whether it is archived or deleted and when this should happen. They are also quite prescriptive (some are mandated), and their classification of content from a retention and disposal point of view poses many challenges, both technically and organisationally. While for the sake of size, this article is not going to get into this topic in detail, I would advise any SharePoint practitioner to understand the relevant disposal authorities that their organisation has to adhere to. You will come away with a new respect for the challenges that record managers face, an understanding on why they use the classification schemes that they do, why records management systems are not popular among users of the systems and why the paradox around “chasing compliance only to become non-compliant” happens.

Maybe you might come away with some insights on how to better integrate SharePoint into the story? Then you can tell the rest of us Smile

Thanks for reading

Paul Culmsee

[email protected]

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Jan 22 2013

Confession of a (post) SharePoint architect… What are you polishing?

Send to Kindle

Hi and welcome to the latest exciting instalment in my epic series of posts on my confessions of a post SharePoint architect. I was motivated to write this series because the mild mannered shrinking violet known as Bjorn Furuknap wrote an insightful series of articles on what it takes to be a SharePoint professional. I had always planned to write on this topic as well and opted to frame it as SharePoint “confessions” because some of my approaches do not always seem mainstream (but work!) so it feels like I am confessing my sins for using them. I chose to use the word “(post)” because SharePoint is not my fulltime gig anymore. I am very lucky to do a lot of non IT work, helping organisations deal with highly complex problems. This side of my work is where most of my insights have come from and what inspired the Heretics Guide to Best Practices book.

Thus far, we have traversed a fair bit of territory via the use of f-laws – home truths about successful SharePoint delivery that focuses on areas often overlooked for various reasons. In case this is your first visit to this series, we have covered 6 f-laws so far and I strongly suggest that you read them first…

In this post, we are continuing with f-law 6, focusing on aspects to SharePoint delivery where geeks have a habit of being crap…

No matter how much you polish it…

In Australia, there is an old saying, that no matter how much you try and polish a turd, it will always be a turd. In the last post, I more or less stated that some geeks have a tendency to polish turds. They do this because of a combination of an inflated view of their self-importance, mental scars from ghosts of disaster recoveries past, and a bias toward something I termed dial tone governance.

Dial tone governance refers to all of the stuff that ensures that the SharePoint platform remains pristine, consistently reliable and high performing. I noted in the previous post that this echoes what quality assurance aspires to do. This type of IT assurance for SharePoint is completely necessary, but it is definitely not sufficient. If it was, lavish praise would be heaped upon us heroic geeks for consistent fantastic SharePoint delivery.

In the last post I also channelled Neo from the Matrix and suggested that being a hero like Neo is a thankless job since, for many stakeholders, the assurance of dial tone is assumed to be there. Whether this is right or wrong is not the point, because geeks do not survive their own hypocrisy on this matter. After all, no one thanks the telephone company for providing them with dial tone when they pick up the phone to make a call – they just get pissed when it is not there.

Now while I can sympathise with unloved telephone company engineers, they actually have it easy because once they provide dial tone, their job is done. This also applies to tools like Microsoft Exchange, Virtualisation, IP networks and storage. Unfortunately with SharePoint, successful delivery is not judged on whether the level of dial tone is appropriate. At the end of the day no amount of turd polishing via awesome support, consistent process or fast response time will make a crap solution anything other than a crap solution.

So this raises a couple of questions that readers should consider:

  1. Am I focusing too much (or too little) on dial tone governance?
  2. What are the other governance areas that I need to focus on?

As it happens, I have some data we can use to answer them.

The hardest thing…

In 2009 I created my SharePoint Governance and Information Architecture class. The class is attended by a wide variety of roles, from BAs to PMs, CIOs as well as developers and tech dudes. It has been delivered around the world with students representing just about every industry sector (including Microsoft employees). This combination of varied audience, varied industry sectors and geographic location has provided a lot of insight, because at the start of every class, I ask students to introduce themselves, and tell the class what they feel is the hardest thing about SharePoint delivery and I dialogue map the answers.

Can you see the logic of this question? By listing all of the areas that is hard about SharePoint delivery, what should emerge are the areas we should be focusing on. Why? Well the hard bits are likely to be the areas of most risk when it comes to a failed or stressed deployment.

So let’s go through the answers given to me from a few SPGovIA classes. Maybe there are some consistent patterns that emerge. It will also be interesting to see how much of it is dial tone governance.

Brisbane 2012 and Melbourne 2010

First up, here are the answers I captured from a small class in Brisbane in 2012:

  • Explaining what SharePoint is
  • User uptake (“People do not like new things”)
  • Managing proliferation of SharePoint sites
  • Too much IT ownership (“Sick of IT people telling me that SP is the solution”)
  • Users don’t know what they want
  • Difficulties around SP ownership because of a lack of accountability

For me some interesting things emerge already, but before we get into detail, let’s look at a Melbourne class answering this question two years earlier and see if any consistent patterns.

  • Every project is “new” (“Traditional ASP.NET web site development is ‘same old same old’)
  • In SharePoint you can do things in many ways so the initial design takes longer
  • The solution is never the same as the initial design and the end client may not realise this. The implication is gaps between expectation and delivery
  • Stakeholders don’t know what they want (“First time around what they sign off on is not what they want “)
  • Projects launched as “IT projects” with no clear deliverable and no success indicators
  • Lack of visibility as to what other organisations are doing
  • Determining limits and boundaries (“Doing anything ‘practically’ in SharePoint is hard”).
    • For example: We improved Ux in certain areas, but to extend to the entire feature set would take too long”
  • Managing expectations around SharePoint.
    • Clients with no experience think it can do everything
    • Difficulties getting information from and translating into design, so it can be implemented
  • Legacy of bad implementations makes it hard to win the business owner
  • Lack of governance
    • Viral spread of unmanaged sites
    • No proper requirements of “why”
    • No-one managing it

Some analysis…

The first thing that I notice is that if you go back to the start of this post and review the six f-laws, four are clearly represented here. We have stakeholders not knowing what they want which makes design hard (f-law 2), the gap between expectation and delivery (f-law 5), the problem of SharePoint projects being done as “IT projects” (f-law 6) with “no clear deliverables and no success indicators” (f-law 3). Other themes include lack of accountability and managing viral growth of sites, but the overwhelming theme that comes through for me is that of managing user expectations and buy-in.

A telling part about what is listed is that aside from the ever present issue of managing site sprawl, not too much of it is dial tone at this point. To see if this pattern continues, let’s head to Auckland New Zealand and see if the Kiwis are any more geeky than their Australian cousins…

Auckland 2011

  • Gap between expectations and reality
  • Accountabilities and role clarity around delivery
  • Knowledge transfer and ongoing maintenance (“Not everything is written down and when people leave, key critical information is lost. For example: Business rules set up at the start are lost over time”)
  • Helping people change practices (“Getting people to use it “)
  • Managing the growth over time (“the challenge of a large user base wanting to use it in different ways”)
  • It’s a big, complex product
  • The perception of “mystique” around SharePoint (“hard to know what not to do”)
  • Seen as another “IT service”
    • product selected because it’s Microsoft
    • the people who chose the product/delivering the product are IT
  • Translating the capabilities of the product to the needs of the user
    • Getting the business to understand SharePoint’s capability
    • Restrictions vs freedom
  • Ramp up time: The learning curve across all roles (tech and non tech)
  • The challenge of user requirements: Knowing the right questions to ask

Some more analysis…

It is clear that the themes that emerged from the two classes in Australia are also consistent here. The issue of stakeholder expectations came up straight away as well as the “IT driven project” issue (“seen as another ‘IT’ service”). Once again, the only real dial tone governance issue was the problem of managing site growth over time, but even then, it was framed more of an expectations issue (“the challenge of a large user base wanting to use it in different ways”). F-law 4 also copped a mention in terms of knowing the right questions to ask to get the right user requirements.

The additional themes that I noted from this group were:

  • complexity (“It’s a big, complex product“)
  • change management (“helping people change practices”)
  • the high learning curve of SharePoint for users
  • knowledge transfer over time the challenge of a large user base wanting to use the product in different ways.

<geek alert>Now if you are reading this and you manage complex infrastructure, let me assure you that tech people were in the classes</geek alert>. Also, since Australia and New Zealand are culturally quite similar to each other, it could be argued that we are not taking into account different cultures. So let’s find out what a 2012 class in Singapore had to say…

Singapore 2012

  • Trying to deal with the sheer number of features
  • “A totally different kind of concept”
    • A little knowledge can be dangerous
    • If you start with the wrong footing, you end up messing it up
  • Trying to deal with “I need SharePoint”
  • SharePoint for an external web site was difficult to use. Unfriendly structure for a public facing website
  • Trying to get users to use it (Steep learning curve for users)
  • The need for “deep discussion” to ensure SharePoint is put in for the right reasons. Without this, the result is messy, disorganised portals
  • The gap between the business and IT results in a sub optimal deployment
  • Demonstrating value to the business (SharePoint installed, but its potential is not being realized)
  • Stakeholders not appreciating the implication of product versus platform
  • You are working across the entire business (The disconnect between management/coalface)
  • “Everything hurts with SharePoint”
  • Facilitating the discussion at the business level is hard when your background is IT

Final Analysis

Once again the answers provided by Singapore attendees is extraordinarily consistent with the other three classes we looked at. User expectations and adoption were at the forefront, complexity was there, as was the business/IT disconnect as well as demonstrating business value. The theme of platitudes (f-law 4) and confusing the means from the ends (f-law 1) was apparent with the comment about dealing with the “I need SharePoint” issue.

I also note that the Singapore group seemed to have a greater recognition of their weaknesses – particularly with SharePoint as a “totally different type of concept” quote and last comment about difficulties facilitating discussion “when your background is IT”. I also noted one potential dial-tone comment about the difficulty of deploying SharePoint as a public facing website. Another emergent complexity related theme here is the perennial problem of SharePoint’s ample supply of features (and caveats) which risks an inappropriate up-front design decision that has negative consequences later (“Trying to deal with the sheer number of features,“ “A little knowledge can be dangerous” & “If you start with the wrong footing, you end up messing it up.”)

Finally, I particularly liked the comment about the “need for “deep discussion” to ensure SharePoint is put in for the right reasons” – that one was made by one of the Microsofties who attended the class.

Conclusions and takeaways

My own conclusion from this examination is that the responses from class attendees illustrate that dial tone governance (which is best termed as IT assurance) is necessary, but certainly not sufficient. The focus on IT assurance is a reflection of the lens that IT looks through. After all, when your performance is judged on keeping things running smoothly and reliably, it makes sense that you will focus on this.

But as illustrated by the responses, it seems that IT assurance isn’t all that hard. If it was, then why didn’t dial tone topics come up with more frequency in the responses?

So IT people, always remember f-law 1. The word ‘govern’ means to steer. We aim to steer the energy and resources available for the greatest benefit to all. Assurance on the other hand provides confidence in a product’s suitability for its intended purpose. It is a set of activities intended to ensure that customer requirements are satisfied in a systematic, reliable fashion. (I didn’t make that up by the way – that is how the ISO9000 family of standards for quality management described assurance).

The key takeaway is that to be effective and successful you actually need to apply both governance and assurance. You cannot have one without the other. Whether you have the balance right between dial tone and all the other stuff is for you to decide. So rather than focus on the stuff you already know well, perhaps it is worth asking yourself what you find hard and focus there as well.

 

Thanks for reading

 

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Dec 30 2012

Confessions of a (post) SharePoint architect: The dangers of dial tone governance…

Send to Kindle

Hi all and welcome to the next exciting instalment of my confessions from my work as a SharePoint architect and beyond. This is the eighth post and my last for 2012, so I will get straight into it.

To recap, along the way we have examined 5 f-laws and learned that:

Now, as a preamble to today’s mini-rant, I need to ‘fess up. I know this might come as a shock, but there was once a time when I was not the sweet, kind hearted, gentle soul who pens these articles. In my younger days, I used to judge my self-worth on my level of technical knowledge. As a result of this, I knew my stuff, but was completely oblivious to how much of a pain in the ass I was to everyone but geeks who judged themselves similarly. Met anyone like that in IT? Smile

This brings me onto my next SharePoint governance f-law – one that highlights a common blind spot that many IT people have in their approach to SharePoint governance.

F-Law 6: Geeks are far less important than they think they are

All disciplines and organisational departments have a particular slant on reality that is based on them at the centre of that reality. If this was not true, then departments would not spend so much time bitching about other departments and I would have no Dialogue Mapping work. The IT department is no better or worse in this regard than any other department, except that the effects of their particular slant of reality can be more pronounced and far reaching on everyone else. Why? Because the IT slant of reality sometimes looks like a version of Neo from the Matrix. Many, if not most people in IT, have a little Neo inside of them.

image

We all know Neo – an uber hero. He is wise, blessed with super powers, can manipulate your very reality and is a master of all domains of knowledge. Neo is also your last hope because if he goes down, we all go down. Therefore, everything Neo does – no matter how over the top or what the consequences are – is necessary to save the world from evil.

All of the little Neos in IT have a few things in common with bullet stopping big Neo above. Firstly, little Neo has also been entrusted with ensuring that the environment is safe from the forces of evil. Secondly, Little Neo can manipulate the reality that everybody else experiences. And finally, little Neo is often the last hope when things go bad. But that is where the similarities end because big Neo has two massive advantages over little Neo. First, big Neo was a master of a lot of domains of knowledge because he had the convenience of being able to learn any new subject by downloading it into his brain. Little Neo does not have this convenience, yet many little Neos still think they are all-knowing and wise. Secondly, big Neo was never mentally scarred from a really bad tequila bender…

Bad tequila bender? What the…

Never again…

Years ago when I was young and dumb, I was at a party drinking some tequila using the lemon and salt method. My brother-in-law thought it would be hilarious to switch my tequila shots with vodka double shots. Unfortunately for me, I didn’t notice because the lemon and salt masked the taste. I downed a heap of vodkas and the net result for me was not pretty at all. Although I wasn’t quite as unfortunate as the guy in the picture below I wasn’t that far off. As a result, to this day I cannot bring myself to drink tequila or vodka and the smell of it makes me feel sick with painful memories best left supressed.

image

I’m sure many readers can relate to a story like this. Most people have had a similar experience from an alcohol bender, eating a dodgy oyster or accidentally drinking tap water in a place like Bali. So take a moment to reflect on your absolute worst experience until you feel clammy and queasy. Feeling nauseous? Well guess what – there is something even worse…

Anyone who has ever worked in a system administrator role or any sort of infrastructure support will know the feeling of utter dread when the after hours pager goes off, alerting you some sort of problem with the IT infrastructure on which the business depends. Like many, I have lived through disaster recovery scenarios and let me tell you – they are not fun. Everyone turns to little Neo to save the day. It is high pressure and stressful trying to get things back on track, with your boss breathing down your neck, knowing that the entire company is severely degraded until you to get things back online.

Now while that is bad enough, the absolute nightmare scenario for every little Neo in IT is having to pick up the pieces of something not of their doing in the first place. In other words, somehow a non-production system morphed into production and nobody bothered to tell Little Neo. In this situation, not only is there the pressure to get things back as quickly as possible, but Little Neo has no background knowledge of the system being recovered, has no documentation on what to do, never backed it up properly and yet the business expects it back pronto.

So what do you expect will happen in the aftermath of a situation like the one I described above? Like my aversion to tequila, Little Neo will develop a pathological desire to avoid reliving that sort of pain and stress. It will be an all-consuming focus, overriding or trivialising other considerations. Governance for little Neo is all about avoiding risk and just like Big Neo, any actions – no matter how over the top or what the consequences are – will be deemed as necessary to ensure that risk is mitigated. Consequently, a common characteristic of lots of little Neos is the classic conservative IT department who defaults to “No” to pretty much any question that involves introducing something new. Accordingly, governance material will abound with service delivery aspects such as lovingly documented physical and logical architecture, performance testing regimes, defining universal site templates, defining security permissions/policies, allowed columns, content types and branding/styling standards.

Now all of this is nice and needs to be done. But there is a teeny problem. This quest to reduce risk has the opposite effect. It actually increases it because little Neo’s notion of governance is just one piece of the puzzle. It is the “dial tone” of SharePoint governance.

The thing about dial tone…

What is the first thing you hear when you pick up the phone to make a call? The answer of course is dial tone.

Years ago, Ruven Gotz asked me if I had ever picked up the phone, heard dial tone and thought “Ah, dial tone… Those engineers down at the phone company are doing a great job. I ought to bake them a  cake to thank them.” Of course, my answer was “No” and if anyone ever answered “Yes” then I suspect they have issues.

This highlights an oft-overlooked issue that afflicts all Neos. Being a hero is a thankless job. The reality is that the vast majority of the world could not care less that there is dial tone because it is expected to be there – a minimum condition of satisfaction that underpins everything else. In fact, the only time they notice dial tone is when it’s not there.

Yet, when you look at the vast majority of SharePoint governance material online, it could easily be described as “dial tone governance.” It places the majority of focus on the dial tone (service delivery) aspects of SharePoint and as a result, de-emphasises much more important factors of governance. Little Neo, unfortunately, has a governance bias that is skewed towards dial tone.

Keen eyed readers might be thinking that dial tone governance is more along the lines of what quality assurance is trying to do. I agree. Remember in part 2 of this series, I explained that the word ‘govern’ means to steer. We aim to steer the energy and resources available for the greatest benefit to all. Assurance, according to the ISO9000 family of standards for quality management, provides confidence in a product’s suitability for its intended purpose. It is a set of activities intended to ensure that requirements are satisfied in a systematic, reliable fashion. Dial tone governance is all about assurance, but the key word for me in the previous sentence is “intended purpose.”

Dial tone governance is silent on “intended purpose” which provides opportunities for platitudes to fetser and governance becoming a self fulfilling prophecy.

and finally for 2012…

So, all of this leads to a really important question. If most people do not care about dial tone governance, then what do they care about?

As it happens, I’m in a reasonable position to be able to answer that question as I’ve had around 200 people around the world do it for me. This is because in my SharePoint Governance and Information Architecture Class, the first question I ask participants is “What is the hardest thing about SharePoint delivery?”

The question makes a lot of sense when you consider that the hardest bits of SharePoint usually translate to the highest risk areas for SharePoint. Accordingly, governance efforts should be focused in those areas. So in the next post in this series, I will take you through all the answers I have received to this question. This is made easier because I dialogue mapped the discussions, so I have built up a nice corpus of knowledge that we can go through and unpack the key issues. What is interesting about the answers is that no matter where I go, or whatever the version of SharePoint, the answers I get have remained extremely consistent over the years I have run the class.

Thanks for reading…

 

Paul Culmsee

p.s I am on vacation for all of January 2013 so you will not be getting the next post till early Feb

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Sep 07 2012

Save the date in October: SharePoint Governance and Dialogue Mapping in the UK

Send to Kindle

Hi all

Just to let you know that in October, I will be in the UK to run a SharePoint Governance and Information Architecture class with Andrew Woodward. Additionally, I am very pleased to offer a Dialogue Mapping introductory course for the first time in the UK as well. Work has been extremely busy this year and this is my only UK/Europe trip in the next 9-12 months. In short, this is likely to be a once-off opportunity as I travel less and less these days.

Introductory Dialogue Mapping October 17-18, 2012

  • Venue: The Custard Factory, Birmingham, UK
  • Cost: £995

Eventbrite - UK: Solving Complex Problems with Issue Mapping

The introductory Dialogue mapping class will arm you with a life skill that can be used in many different situations and has changed my career. If you have been following my “confessions of a (post) SharePoint Architect” series, a lot of the content is based on my experiences of Dialogue Mapping many different projects in many different industries. Dialogue Mapping is a novel, powerful and inclusive method to elicit requirements, capture knowledge and develop shared understanding in complex projects, such as SharePoint or broader strategic planning. It was pioneered by CogNexus Institue in California, and is used by NASA, the World Bank and United Nations.

My book, “The Heretics Guide to Best Practices” is based on my Dialogue Mapping work and if you liked the book, then I know you will love the course!

What does a map look like? Check out my map of the AA1000 Stakeholder Engagement Standard or my synthesis on problems with intranet search below…

image  image

I should stress that this is not a SharePoint course. If you are an organisational development practitioner, facilitator, reformed project manager, all-round agitator or are simply interested in helping groups make sense of complex situations, then you would find this class to be highly valuable in your personal arsenal of tools and techniques. When performed live during a facilitated session, it is a highly efficient and engaging experience for participants.

Please note that seats are limited in this class and it cannot be more than  10.

  • Date: October 17-18, 2012
  • Venue: The Custard Factory, Birmingham, UK
  • Cost: £995

Eventbrite - UK: Solving Complex Problems with Issue Mapping


Aligning SharePoint Governance & Information Architecture to Business Goals October 15-16 2012

  • Venue: The Custard Factory, Birmingham, UK
  • Cost: £995
  • Limited seats available: 12

Eventbrite - #SPGov+IA Aligning SharePoint Governance & Information Architecture to Business Goals with Paul Culmsee

Previous Master Class Feedback:

  • "This course has been the most insightful two days of my SharePoint career"
  • "…Was the best targetted and jargon free course I’ve ever been on"
  • "Re-doing my draft SharePoint Governance. Moving away from blah, blah technical stuff"
  • "Easily one of the best courses I’ve been to and has left me wanting more!"
  • "Had a great couple of days at #SPIAUK loving IBIS"
  • "The content covered was about the things technically focussed peeps miss.."

Most people understand that deploying SharePoint is much more than getting it installed. Despite this, current SharePoint governance documentation abounds in service delivery aspects. However, just because your system is rock solid, stable, well documented and governed through good process, there is absolutely no guarantee of success. Similarly, if Information Architecture for SharePoint was as easy as putting together lists, libraries and metadata the right way, then why doesn’t Microsoft publish the obvious best practices?

In fact, the secret to a successful SharePoint project is an area that the governance documentation barely touches.

This master class pinpoints the critical success factors for SharePoint governance and Information Architecture and rectifies this blind spot. Based upon content provided by Paul Culmsee (Seven Sigma) which takes an ironic and subversive take on how SharePoint governance really works within organisations, while presenting a model and the tools necessary to get it right.

Drawing on inspiration from many diverse sources, disciplines and case studies, Paul Culmsee has distilled in this Master Class the “what” and “how” of governance down to a simple and accessible, yet rigorous and comprehensive set of tools and methods, that organisations large and small can utilise to achieve the level of commitment required to see SharePoint become successful.

Seven Sigma, together with 21apps, are bringing the the acclaimed SharePoint Governance and Information Architecture Master class back to the UK, October 2012.

  • Date: October 15-16, 2012
  • Venue: The Custard Factory, Birmingham, UK
  • Cost: £995
  • Limited seats available: 12

Eventbrite - #SPGov+IA Aligning SharePoint Governance & Information Architecture to Business Goals with Paul Culmsee

 

Thanks for reading

Paul Culmsee

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Aug 16 2012

Demystifying SharePoint Performance Management Part 11 – Tales from the Microsoft labs

Send to Kindle

Hi all and welcome to the final article in my series on SharePoint performance management – for now anyway. Once SharePoint 2013 goes RTM, I might revisit this topic if it makes sense to, but some other blogging topics have caught my attention.

To recap the entire journey, the scene was set in part 1 with the distinction of lead and lag indicators. In part 2, we then examined Requests per Second (RPS) and looked at its strengths and weakness as a performance metric. From there, we spent part 3 looking at how to leverage RPS via the Log Parser utility and a little PowerShell goodness. Part 4 rounded off our examination of RPS by delving deeper into utilising log parser to eke out interesting RPS related performance metrics. We also covering the very excellent SharePoint Flavored Weblog Reader utility, which saves a bunch of work and can give some terrific insights. Part 5 switched tack into the wonderful world of latency, and in particular, focused on disk latency. Part 6 then introduced the disk performance metrics of IOPS, MBPS and their relationship to latency. We also looked at typical SharePoint and SQL Server disk IO characteristics and then examined the pros and cons of RPS, IOPS, Latency, MBPS and how they all relate to each other. In part 7 and continuing into part 8, we introduced the performance monitor counters that give us insight into these counters, as well as introduced the SQLIO utility to stress test disk infrastructure. This set the scene for part 9, where we took a critical look at Microsoft’s own real-world findings to help us understand what suitable figures would be. The last post then introduced a couple of other excellent tools, namely Process Monitor and Windows Performance Analysis Toolkit that should be in your arsenal for SharePoint performance.

In this final article, we will tie up a few loose ends.

Insights from Microsoft labs testing…

In part 9 of this series, I examined Microsoft’s performance figures reported from their production SharePoint 2010 deployments. This information comes from the oft mentioned SharePoint 2010 Capacity Planning Guide. Microsoft are a large company and they have four different SharePoint farms for different collaborative scenarios. To recap, those scenarios were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company. Performance reported for this environment was 33580 unique users per day, with an average of 172 concurrent users and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. Performance reported for this environment was double the first environment with 69702 unique users per day. Concurrency was more than double, with an average of 420 concurrent users and a peak concurrency of 1433 users.
  3. Departmental collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department. Performance reported for this environment was a much lower figure of 9186 unique users per day (which makes sense given it is departmental stuff). Nevertheless, concurrency was similar to the enterprise intranet scenario with an average of 189 concurrent users and a peak concurrency of 322 users.
  4. Social collaboration environment. This is Microsoft’s My Sites scenario, connecting employees with one another and presenting personal information such as areas of expertise, past projects, and colleagues to the wider organization. This included personal sites and documents for collaboration. Performance reported for this environment was 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

Presented as a table, we have the following rankings:

Scenario Social Collaboration Enterprise Intranet Collaboration Enterprise Intranet Departmental Collaboration
Unique Users 69814 69072 33580 9186
Avg Concurrent 639 420 172 189
Peak  Concurrent 1186 1433 376 322

When you think about it, the performance information reported for these scenarios are lag indicator based. That is, they are real-world performance statistics based on a pre-existing deployment. Thus while we can utilise the above figures for some insights into estimating the performance needs of our own SharePoint environments, they lack important detail. For example: in each scenario above, while the SharePoint farm topology was specified, we have no visibility into how these environments were scaled out to meet performance needs.

Some lead indicator perspective…

Luckily for us, Microsoft did more than simply report on the performance of the above four collaboration scenarios. For two of the scenarios Microsoft created test labs and published performance results with different SharePoint farm topologies. This is really handy indeed, because it paints a much better lead indicator scenario. We get to see what bottlenecks occurred as the load on the farm was increased. We also get insight about what Microsoft did to alleviate the bottlenecks and what sort of a difference it made.

The first lab testing was based off Microsoft’s own Departmental collaboration environment (the 3rd scenario above) and is covered on pages 144-162 of the capacity planning guide. The other lab was based off the Enterprise Intranet Collaboration Environment (the 2nd scenario above) and is the focus of attention on pages 174-197. Consult the guide for full detail of the tests. This is just a quick synthesis.

Lab 1: Enterprise Intranet Collaboration Environment

In this lab, Microsoft took a subset of the data from their production environment using different hardware. They acknowledge that the test results will be affected by this, but in my view it is not a show stopper if you take a lead indicator viewpoint. Microsoft tested web server scale out initially by starting out with a 3 server topology (one web front end server, one application server and one database server). They then increased the load on the farm until they reached a saturation point. Once this happened, they added an additional web server to see what would happen. This was repeated and scaled from one Web server (1x1x1) to five Web servers (5x1x1).

The transactional mix used for this testing was based on the breakdown of transactions from the live system. Little indication of read vs. write transactions are given in the case study, but on page 152 there is a detailed breakdown of SharePoint traffic by type. While I won’t detail everything here, regular old browser traffic was the most common, representing 36% of all test traffic. WebDAV came in second (WebDAV typically includes office clients and windows explorer view) representing 28.12 of traffic and outlook sync traffic third at 7.04%.

Below is a table showing the figures where things bottlenecked. Microsoft produce many graphs in their documentation so the figures below are an approximation based on my reading of them. It is also important to note that Microsoft did not perform tests while search was running, and compensated for search overhead by defining a max CPU limit for SQL Server of 80%.

1*1*1 2*1*1 3*1*1 4*1*1 5*1*1
Max RPS 180 330 510 560 565
Sustainable RPS 115 210 305 390 380
Latency .3 .2 .35 .2 .2
IOPS 460 710 910 920 840
WFE CPU 96% 89% 89% 76% 58%
SQL CPU 17% 33% 65% 78% 79%

For what its worth, the sustainable RPS figure is based on the server not being stressed (all servers having less than 50% CPU). Looking at the above figures, several things are apparent.

  1. The environment scaled up to four Web servers before the bottleneck changed to be CPU usage on the database server
  2. Once database server CPU hit its limits, RPS on the web servers suffered. Note that RPS from 4*1*1 to 5*1*1 is negligible when SQL CPU was saturated.
  3. The addition of the fourth Web server had the least impact on scalability compared to the preceding three (RPS only increased from 510 to 560 which is much less then adding the previous web servers). This suggests the SQL bottleneck hit somewhere between 3 and 4 web servers.
  4. The average latency was almost constant throughout the whole test, unaffected by the number of Web servers and throughput. This suggests that we never hit any disk IO bottlenecks.

Once Microsoft identified the point at which Database server CPU was the bottleneck (4*1*1), they added an additional database server and then kept adding more webservers like they did previously. They split half the content databases onto one SQL server and half on the other. It is important to note that the underlying disk infrastructure was unchanged, meaning that total disk performance capability was kept constant even though there were now two database servers. This allowed Microsoft to isolate server capability from disk capability. Here is what happened:

4*1*1 4*1*2 6*1*2 8*1*2
RPS 560 660 890 930
Latency .2 .35 .2 .2
IOPS 910 1100 1350 1330
WFE CPU 76% 87% 78% 61%
SQL CPU 78% 33% 52% 58%

Here is what we can glean from these figures.

  1. Adding a second database server did not provide much additional RPS (560 to 660). This is because CPU utilization on the Web servers was high. In effect, the bottleneck shifted back to the web front end servers.
  2. With two database servers and eight web servers (8*1*2), the bottleneck became the disk infrastructure. (Note the IOPS at 6*1*2 is no better than 8*1*2).

So what can we conclude? From the figures shown above, it appears that you could reasonably expect (remember we are talking lead indicators here) that bottlenecks are likely to occur in the following order:

  1. Web Server CPU
  2. Database Server CPU
  3. Disk IOPS

It would be a stretch to suggest when each of these would happen because there are far too many variables to consider. But let’s now examine the second lab case study to see if this pattern is consistent.

Lab 2: Divisional Portal Environment

In this lab, Microsoft took a different approach from lab we just examined. This time they did not concern themselves with IOPS (“we did not consider disk I/O as a limiting factor. It is assumed that an infinite number of spindles are available”). The aim this time was to determine at what point a SQL Server CPU bottleneck was encountered. Based on what I have noted from the first lab test above, unless your disk infrastructure is particularly crap, SQL Server CPU should become a bottleneck before IOPS. However, one thing in common with the last lab test was that Microsoft factored in the effects of an ongoing search crawl by assuming 80% SQL Server CPU as the bottleneck indicator.

Much more detail was provided on the transaction breakdown for this lab. Page 181 and 182 lists transactions by type and and unlike the first lab, whether they are read or write. While it is hard to directly compare to lab 1, it appears that more traffic is oriented around document collaboration than in the first lab.

The basic methodology of this was to start off with a minimal farm configuration of a combined web/application server and one database server. Through multiple iterations, the test ended with a configuration of three Web servers, one application server, one database server.  The table of results are below:

1*1 1*1*1 2*1*1 3*1*1
RPS 101 150 318 310
Sustainable RPS 75 99 191 242
Latency .81 .85 .6 .8
Users simulated 125 150 200 226
WFE CPU 86% 36% 76% 42%
APP CPU NA 41% 46% 44%
SQL CPU 18% 32% 56% 75%

Here is what we can glean from these figures.

  1. Web Server CPU was the first bottleneck encountered.
  2. At a 3*1*1 configuration, SQL Server CPU became the bottleneck.  In lab 1 it was somewhere between the 3rd and 4th webserver.
  3. RPS, when CPU is taken into account, is fairly similar between each lab. For example, in the first lab, the 2*1*1 scenario RPS was 330. In this lab it was 318 and both had comparable CPU usage. The 1*1*1 test, had differing results (101 vs 180) , but if you adjust for the reported CPU usage, things even up.
  4. With each additional Web server, increase in RPS was almost linear. We can extrapolate that as long as SQL Server is not bottlenecked, you can add more Web servers and an additional increase in RPS is possible.
  5. Latencies are not affected much as we approached bottleneck on SQL Server. Once again, the disk subsystem was never stressed.
  6. The previous assertion that bottlenecks are likely to occur in the the order of Web Server CPU, Database Server CPU and then Disk subsystem appears to hold true.

Now we go any further, one important point that I have neglected to mention so far is that the figures above are extremely undesirable. Do you really want your web server and database server to be at 85% constantly? I think not. What you are seeing above are the upper limits, based on Microsoft’s testing. While this helps us understand maximum theoretical capacity, it does not make for a particularly scalable environment.

To account for the issue of reporting on max load, Microsoft defined what they termed as a “green zone” of performance. This is a term to describe what “normal” load conditions look like (for example, less than 50% CPU) and they also provided RPS results for when the servers were in that zone. If you look closely in the above tables you will see those RPS figures there as I labelled them as “Sustainable RPS”.

In case you are wondering, the % difference between sustainable RPS and peak RPS for each of the scenarios ranges between 60-75% of the peak RPS reported.

Some Microsoft math…

In the second lab, Microsoft offers some advice on how translate their results into our own deployments. They suggest determining a users to RPS ratio and then utilising the green zone RPS figures to estimate server requirements. This is best served via their own example from lab 2: They state the following:

  • A divisional portal in Microsoft which supports around 8000 employees collaborating heavily, experiences an average RPS of 110.
  • That gives a Users to RPS ratio of ~72 (that is, 8000/110). That is: 72 users will amount to 1 RPS.
  • Using this ratio and assuming the sustainable RPS figures from lab 2 results, Microsoft created the following table (page 196) to suggest the number of users a typical deployment might support.

image

A basic performance planning methodology…

Okay.. so I am done… I have no more topics that I want to cover (although I could go on forever on this stuff). Hopefully I have laid out enough conceptual scaffolding to allow you to read Microsoft’s large and complex documentation regarding SharePoint performance and capacity guide with more clarity than before. If I were to sum up a few of the key points of this 11 part exploration into the weird and wonderful world of SharePoint performance management it would be as follows:

  1. Part 1: Think of performance in terms of lead and lag indicators. You will have less of a brain fart when trying to make sense of everything.
  2. Part 2: Requests are often confused with transactions. A transaction (eg “save this document”) usually consists of multiple requests and the number of requests is not an indicator of performance. Look to RPS to help here…
  3. Part 3 and 4: The key to utilising RPS is to understand that as a counter on its own, it is not overly helpful. BUT it is the one metric that you probably have available in lots of detail, due to it being captured in web server logs over time. Use it to understand usage patterns of your sites and portals and determine peak usage and concurrent usage.
  4. Part 5: Latency (and disk latency in particular) is both unavoidable, yet one of the most common root causes of performance issues. Understanding it is critical.
  5. Part 6: Disk latency affects – and is affected by – IOPS, IO size and IO patterns. Focusing one one without the others is quite pointless. They all affect each other so watch out when they are specified in isolation (ie 5000 IOPS).
  6. Part 6, 7 and 8:  Latency and IOPS are handy in the  sense that they can be easily simulated and are therefore useful lead indicators. Test all SQL IO scenarios at 8k and 64K IO size and ensure it meets latency requirements.
  7. Part 9: Give your SAN dudes a specified IOPS, IO Size and latency target. Let them figure out the disk configuration that is needed to accommodate. If they can make your target then focus on other bottleneck areas.
  8. Part 10: Process Monitor and Windows Performance Analyser are brilliant tools for understanding disk IO patterns (among other things)
  9. Part 9 and 11: Don’t believe everything you read. Utilise Microsoft’s real world and lab results as a guide but validate expected behaviour by testing your own environment and look for gaps between what is expected and what you get.
  10. Part 11: In general, Web Server CPU will bottleneck first, followed by SQL Server CPU. If you followed the advice of points 6 and 7 above, then disk shouldn’t  be a problem.

Now I promised at the very start of this series, that I would provide you with a lightweight methodology for estimating SharePoint performance requirements. So assuming you have read this entire series and understand the implications, here goes nothing…

If they can meet this target, skip to step 8.  :-)

If they cannot meet this, don’t worry because there are two benefits gained already. First, by finding that they cannot get near the above figures, they will do some optimisation like test different stipe sizes and check some other common disk performance hiccups. This means they now better understand the disk performance patterns and are thinking in terms of lead indicators. The second benefit is that you can avoid tedious, detailed discussions on what RAID level to go with up front.

So while all of this is happening, do some more recon…

  • 4. Examine Microsoft and HP’s testing results that I covered in part 9 and in this article. Pay particular attention to the concurrent users and RPS figures. Also note the IOPS results from Microsoft and HP testing. To remind you, no test ever came in over 1400 IOPS.
  • 5. Use logparser to examine your own logs to understand usage patterns. Not only should you eke out metrics like max concurrent users and RPS figures, but examine peak times of the day, RPS growth rate over time, and what client applications or devices are being used to access your portal or intranet.
  • 6. Compare your peak and concurrent usage stats to Microsoft and HP’s findings. Are you half their size, double their size? This can give you some insight on a lower IOPS target to use. If you have 200 simultaneous users, then you can derive a target IOPS for your storage guys to meet that is more modest and in line with your own organisations size and make-up.

By now the storage guys will come back to you in shock because they cannot get near your 5000 IOPS requirement. Be warned though… they might ask you to sign a cheque to upgrade the storage subsystem to meet this requirement. It won’t be coming out of their budget for sure!

  • 7. Tell them to slowly reduce the IOPS until they hit the 8ms and 1ms latency targets and give them the revised target based on the calculation you made in step 6. If they still cannot make this, then sign the damn cheque!

At this point we have assumed that there is enough assurance that the disk infrastructure is sound. Now its all about CPU and memory.

  • 8. Determine a users to RPS ratio by dividing your total user base by average RPS (based on your findings from step 5).
  • 9.  Look at Microsoft’s published table (page 196 of the capacity planning guide and reproduced here just above this conclusion). See what it suggests for the minimum topology that should be needed for deployment.
  • 10. Use that as a baseline and now start to consider redundancy, load balancing and all of that other fun stuff.

And there you have it! My 10 step dodgy performance analysis method.  Smile

Conclusion and where to go next…

Right! Although I am done with this topic area, there are some next steps to consider.

Remember that this entire series is predicated on the notion that you are in the planning stage. Let’s say you have come up with a suggested topology and deployed the hardware and developed your SharePoint masterpiece. The job of ensuring performance will work to expectations does not stop here. You still should consider load testing to ensure that the deployed topology meets the expectations and validates the lead indicators. There is also a  seemingly endless number of optimisations that can be done within SharePoint too, such as caching to reduce SQL Server load or tuning web application or service application settings.

But for now, I hope that this series has met my stated goal of making this topic area that little bit more accessible and thankyou so much for taking the time to read it all.

 

Paul Culmsee

www.hereticsguidebooks.com

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Aug 08 2012

Demystifying SharePoint Performance Management Part 10 – More tools of the trade…

Send to Kindle

Hi all and welcome to the tenth article in my series on demystifying SharePoint performance management. I do feel that we are getting toward the home stretch here. If you go way back to Part 1, I stated my intent to highlight some common misconceptions and traps for younger players in the area of SharePoint performance management, while demonstrating a better way to think about measuring SharePoint performance (i.e. lead and lag indicators). While doing so, we examined the common performance indicators of RPS, IOPS, MBPS, latency and the tools and approaches to measuring and using them.

I started the series praising some of Microsoft’s material, namely the “Planning guide for server farms and environments for Microsoft SharePoint Server 2010.”, “Capacity Planning for Microsoft SharePoint Server 2010” and “Analysing Microsoft SharePoint Products and Technologies Usage” guides. But they are not perfect by any stretch, and in last post, I covered some of the inconsistencies and questionable information that does exist in the capacity planning guide in particular. Not only are some of the disk performance figures quoted given without any critical context needed to understand how to measure them in a meaningful way, some of the figures themselves are highly questionable.

I therefore concluded Part 9 by advising readers not to believe everything presented and always verify espoused reality with actual reality via testing and measurement.

Along the journey that we have undertaken, we have examined some of the tools that are available to perform such testing and measurement. So far, we have used Log Parser, SharePoint Flavored Weblog Reader, Windows Performance Monitor, SQLIO and the odd bit of PowerShell thrown in for good measure. This article will round things out by showing you two additional tools to verify theoretical fiction with hard cold reality. Both of these tools allow you to get a really good sense of IO patterns in particular (although they both have many other purposes). The first of which will be familiar to my more nerdy readers, but the second of which is highly powerful, but much lesser known to newbies and seasoned IT pros alike.

So without further adieu, lets meet our tools… Process Monitor and Windows Performance Analysis Toolkit.

Process Monitor

Our first tool is Process Monitor, also commonly known as Procmon. Now this tool is quite well known, so I will not be particularly verbose with my examination of it. But for the three of you who have never heard of this tool, Process Monitor allows us to (among many other things), monitor accesses to the file system by processes running on a server. This allows us to get a really low level view of IO requests as they happen in real time. What is really nice about Process Monitor is its granularity. It allows you to set up some sophisticated filtering that allows you to really see the wood from the trees. For example, one can create fairly elaborate filters that allow you to just see the details of a specific SQL database. Also handy is that all collected data can be saved to file for later examination too.

When you start Process Monitor, you will see a screen something like the one below. It will immediately start collecting data about various operations (there are around 140 monitorable operations that cover file system, registry, process, network and kernel stuff). When you launch Process Monitor it immediately starts monitoring file system, registry and processes. The default columns that are displayed include:

  • the name of the process performing the operation
  • the operation itself
  • the path to the object the operation was performed on
  • (and most importantly), a detail column, that tells you the good stuff.

image

The best way to learn Process Monitor is by example, so lets use it to collect SQL Server IO patterns on SharePoint databases when performing a full crawl in SharePoint (while excluding writes to transaction logs). It will be interesting to see the the range of IO request sizes during this time. To achieve this, we need to set up the filters for procmon to give us just what we need…

First up,  choose “Filter…” from the Filter menu.

image

In the top left column, choose “Process Name” from the list of columns. Leave the condition field as “is” and click on the drop down next to it. It will enumerate the running processes, allowing you to scroll down and find sqlservr.exe.

image   image

Click OK and your newly minted filter will be added to the list (check out the green include filter below). Now we will only see operations performed by SQL Server in the Process Monitor display.

image

Rather than give you a dose of screenshot hell, I will not individually show you how to add each filter as they are a similar process to what we just did to include only SQLSERVR.EXE. In all, we have to apply another 5 filters. The first two filter the operations down to reading and writing to the disk.

  • Filter on: Process Name
  • Condition: Is
  • Filter  applied: ReadFile
  • Filter type: Include
  • Filter on: Process Name
  • Condition: Is
  • Filter applied: WriteFile
  • Filter type: Include

Now we need to specify the database(s) that we are interested in. On my test server, SharePoint databases  are on a disk array mounted as D:\ drive. So I add the following filter:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: D:\DATA\MSSQL
  • Filter type: Include

Finally, we want to exclude writes to translation logs. Since all transaction logs write to files with an .LDF extension, we can use an exclusion rule:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: LDF
  • Filter type: Exclude

Okay, so we have our filters set. Now widen the detail column that I mentioned earlier. If you have captured some entries, you should see the word “Length :” with a number next to it. This is reporting the size of the IO request in bytes. Divide by 1024 if you want to get to kilobytes (KB). Below you can see a range of 1.5KB to 32KB.

image

At this point you are all set. Go to SharePoint central administration and find the search service application. Start a full crawl and fairly quickly, you should see matching disk IO operations displayed in Process Monitor. When the crawl is finished, you can choose to stop capturing and save the resulting capture to file. Process Monitor supports CSV format, which makes it easy to import into Excel as shown below. (In the example below I created a formula for the column called “IO Size”.

image

By the way, in my quick test analysis of disk IO of a window during during part of the during a full crawl, I captured 329 requests that were broken down as follows:

  • 142 IO requests (42% of total) were 8KB in size for a total of 1136KB
  • 48 IO requests (15% in total) were 16KB in size for a total of 768KB
  • 48 IO requests (15% in total) were >16KB to 32KB in size for a total of  1136KB
  • 49 IO requests (15% in total) were >32KB to 64KB in size for a total of 2552KB
  • 22 IO requests (7% in total) were >64KB to 128KB in size for a total of 2104KB
  • 20 IO requests (6% in total) were >128KB to 256KB in size for a total of 3904KB

Windows Performance Analyser (with a little help from Xperf123)

Allow me to introduce you to one of the best tools you never knew you needed. Windows Performance Analyser (WPA) is a newer addition to the armoury of tools for performance analysis and capacity planning. In short, it takes the idea of Windows Performance Monitor to a whole new level. WPA comes as part of a broader suite of tools collectively known as the Windows Performance Toolkit (WPT). Microsoft describes the toolkit as:

…designed for analysis of a wide range of performance problems including application start times, boot issues, deferred procedure calls and interrupt activity (DPCs and ISRs), system responsiveness issues, application resource usage, and interrupt storms.”

If most of those terms sounded scary to you, then it should be clear that WPA is a pretty serious tool and has utility for many things, going way beyond our narrow focus of Disk performance. But never fear BA’s, I am not going to take a deep dive approach to explaining this tool. Instead I am going to outline the quickest and simplest way to leverage WPA for examining disk IO patterns. In fact, you should be able to follow what I outline here on your own PC if SharePoint is not conveniently located nearby.

Now this gem of a tool is not available as a separate download. It actually comes installed as part of the Microsoft Windows SDK for Windows 7 and .NET Framework 4. Admins fearing bloat on their servers can rest easy though, as you can choose just to install the WPT components as shown below…

image_thumb6_thumb

By default, the windows performance toolkit will install its goodies into the C:\Program Files\Microsoft Windows Performance Toolkit” folder. So go ahead and install it now since it can be installed onto any version of Windows past Vista. (I am sure that none of you at all are reading this article on an Apple device right? :-).

Now assuming you have successfully installed WPT, I now want you to head on over to codeplex and download a little tool called Xperf123 and save it into the toolkit folder above. Xperf123 is a 3rd party tool that hides a lot of the complexity of WPA and thus is a useful starting point. The only thing to bear in mind is that Xperf123 is not part of WPA and is therefore not a necessity. If your inner tech geek wants to get to know the WPA commands better, then I highly recommend you read a highly comprehensive article written by Microsoft’s Robert Smith and published back in Feb 2012. The article is called “Analysing Storage Performance using the Windows Performance Analysis Toolkit” and it is an outstanding piece of work in this area.

So we are all set. Let’s run the same test as we did with Procmon earlier. I will start a trace on my test SharePoint server, run a full crawl and then look at the resulting IO patterns. Perform the following steps in sequence. (in my example I am using a test SharePoint server).

  1. Start Xperf123 from the WPT installation folder (run it as administrator).
  2. At the initial screen, click Next and then Next again at the screen displaying operating system details
  3. From the Select Trace Type dropdown, choose Disk  I/O and press Next
  4. Ensure that “Enable Perfmon”, “use Circular Logging” and optionally choose “Specify Output Directory”. Press Next
  5. Leave “Stackwalk” unticked and choose Next

image     image

image  image

AllrIghtie then… we are all set! Click the Start Capture Button to start collecting the good stuff! Xperf123 will run the actual WPA command line trace utility (called xperf.exe if you really want to know). Now go to SharePoint central administration and like what we did with our test of Process Monitor, start a full crawl. Wait till the crawl finishes and then in Xperf123, click the Stop Capture Button. A trace file will be saved in the WPT installation folder (or wherever you specify). The naming convention will be based on the server name and date the trace was run.

image  image

image

Okay, so capturing the trace was positively easy. How about analysing it visually?

Double click on the newly minted trace file and it will be loaded into the Performance Analyser analysis tool (This tool is also available from the Start menu as well). When the tool loads and processes the trace file, you will see CPU and Disk IO counts reported visually. The CPU is the line graph and IO counts are represented in a bar graph. Unlike Windows Performance Monitor which we covered in Part 7, this tool has a much better ability to drill down into details.

If you look closely below there are two “flyout” arrows. One is on the left side in the middle of the screen and applies to all graphs and the other is on the top-right of each graph. If you click them, you are able to apply filters to what information is displayed. For example: if you click the “IO Counts” flyout, you can filter out which type of IO you want to see. Looking at the screenshot below, you can see that the majority of what was captured were disk writes (the blue bars below).

image  image

Okay so lets get a closer look at what is going on disk-wise. Right click somewhere on the Disk IO bar graph and choose “Detail Graph” from the menu.

image

Now we have a very different view. On the left we can see which disk we are looking at and on the right we can see detailed performance stats for that disk. It may not be clear by the screenshot, but the disk IO reported below is broken down by process. This detailed view also has flyouts and dropdowns that allow you to filter what you see. There is an upper-left dropdown menu under the section called “Physical Disk”. This allows you to change what disk you are interested in. On the upper right, there is a flyout labelled “Process Name”. Clicking this allows you to filter the display to only view a subset of the process that were running at the time the trace was captured.

image

Now in my case, I only want to see the SQL Database activity, so I will make use of the aforementioned filtering capability. Below I show where I selected the disk where the database files reside and on the right I deselected all processes apart from SQLSERVR.EXE. Neat huh? Now we are looking at the graph of every individual IO operation performed during the time displayed and you can even hover over each dot to get more detail of the IO operation performed.

image

You can also zoom in with great granularity. Simply select a time period from the display using by dragging the mouse pointer over the graph area. Right click the now selected time period and choose “Zoom to Selection”. Cool eh? If your mouse has a wheel button, you can zoom in and out by pressing the Ctrl key and rolling the mouse wheel.

image

Now even for most wussy non technical BA reading this, surely your inner nerd is liking what you see. But why stop here? After all, Process Monitor gave us lots more loving detail and had the ability to utilise sophisticated filtering. So how does WPA stack up?

To answer this question, try these steps, Right click on the detail graph and this time choose “Summary Table”. This allows us to view even more detail of IO data.

image

image

Viola! We now have a list of every IO transaction performed during the sample period. Each line in the summary table represents a single I/O operation. The columns are movable and sortable as well. On that note, some of the more interesting ones for our purposes include (thanks Robert Smith for the explanation of these):

  • IO Type: Read, Write, or Flush
  • Complete Time: Time of I/O completion in milliseconds, relative to start and stop of the current trace.
  • IO Time: The amount of time in milliseconds the I/O took to complete
  • Disk Service Time: The inferred amount of time (in microseconds) the IO operation has spent on the device (this one has caveats, check Robert Smiths post for detail).
  • QD/I: Queue Depth the disk , irrespective of partitions, at the time this I/O request initialized
  • IO Size: Size of this I/O, in bytes.
  • Process Name: The name of the process that initiated this I/O.
  • Path: Path and file name, if known, that is the target of this I/O (in plain English, this essentially means the file name).

I have a lot of IO requests in this summary view, so let’s see how this baby can filter. First up, lets only look at IO that was initiated from SQL Server only. Right click on the “Process Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “SQLSERVR.EXE” in the “Find what:” textbox. Double check that the column name is set to “Process name” in the dropdown and click Filter.

image  image

You should now see only SQL IO traffic. So let’s drill down further still. This time I want to exclude IO transactions that are transaction log related. To do this, right click on the “Path Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “MDF” in the “Find what:” textbox. Double check that the column name is set to “Path name” in the dropdown and click Filter.

image image

Can you guess the effect? Only SQL Server database files will be displayed since they typically have a file extension of MDF.

In the screenshot below, I have then used the column sorting capability to look at the IO sizes. Neat huh?

image

Don’t forget Performance Monitor…

Just before we are done with Windows Performance Analysis Toolkit, cast your mind back to the start of this walkthrough when we used Xperf123 to generate this trace. If you check back, you might recall there was a tickbox in the Xperf123 wizard called “Enable Perfmon”. Well it turns out that Xperf123 had one final perk. While the WPA trace was made, a Perfmon trace was made of the system performance more broadly during the time. These logs are located in the C:\PerfLogs\ directory and are saved in the native Windows Performance Monitor format. So just double click the file and watch the love…

image

How’s that for a handy added bonus. It is also worth mentioning that the Perfmon trace captured has a significant number of performance counters in the categories of Memory, PhysicalDisk, Processor and System.

Conclusion and coming next…

Well! That was a long post, but that was more because of verbose screenshots than anything else.

Both Windows Performance Monitor and Windows Performance Analyser are very useful tools for developing a better understanding of disk IO patterns. While Procmon has more sophisticated filtering capabilities, WPA trumps Procmon in terms of reduced overhead (apparently 20,000 events per second is less than 2% CPU on a 2.0 GHz processor ). WPA also has the ability to visualise and drill down into the data better than Procmon can do.

Nevertheless, both tools have far more utility beyond the basic scenarios outlined in this series and are definitely worth investigating more.

In the next and I suspect final post, I will round off this examination of performance by making a few more general SharePoint performance recommendations and outlining a lightweight methodology that you can use for your own assessments.

Until then, thanks for reading…

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Jul 30 2012

Demystifying SharePoint Performance Management Part 9 – Don’t believe everything you R/W

Send to Kindle

Hi and welcome to Part 9 (bloody hell… nine!) of my series on trying to demystify SharePoint performance management a bit. If by any chance you have been asked to provide some sizing information for your organisation and you are finding the resources online a bit overwhelming, this series is for you. If you have been a part of our varied journey so far, the last few posts have been all about Disk IO performance in the form of latency, IOPS and MBPS. In the last two articles, we have been learning about the different IO patterns that SQL Server is likely to utilise, as well as using the jackhammer known as the SQLIO utility, that is used to simulate those IO patterns on unsuspecting disk infrastructure.

Now just to set the scene for this post (and conveniently perform some product placement), I recently published a book called “The Heretics Guide to Best Practices”. Now being the author and all, I am going to suggest you buy it because it is a completely riveting read! :-).

Now apart from blatant product placement, the real reason I mention it is because one of the chapters is called “Myths, Memes and Methodologies”. In it, we examine why some ideas gain legitimacy, even though they are based on often completely dodgy foundations. I mention this here, because in terms of SQL disk IO sizing, something similar has happened with Microsoft’s published material on the topic. So the focus of this article is to finish off our discussion on understanding disk IO patterns, while lifting the lid on some of the inconsistencies in the material that that end up being repeated by SharePoint consultants as gospel to their unsuspecting disciples.

Now harking way back to part 1 to the notion of lead vs. lag indicators, our use of SQLIO thus far has essentially been used as a lead indicator. While SQLIO puts a real load on disk infrastructure and faithfully reports the resulting IOPS, latency and MBPS, the reality is it can never truly capture the nuances of a production SharePoint farm doing its thing. But in terms of a lead indicator that is okay. After all, a lead indicator by definition cannot guarantee an outcome. It can merely suggest that an outcome should be able to be met.

So while we are thinking about the lead indicator world view, some of you might have noticed that I have not yet made any suggestions what are the minimum conditions of satisfaction for disk infrastructure used to underpin SharePoint. This has been deliberate until now, because I felt that it was critical to understand the relationship between the size of a disk IO operation, and its effect on IOPS, latency and MBPS first. To that end, hopefully I have instilled a reflex in you where – if you are given an arbitrary latency, IOPS or MBPS figure that you have to meet – you immediately ask questions like, “What sort of IO patterns?” or “how large is the IO request typically going to be?” or “is the IO random or sequential?”

When whitepapers mislead…

Now we are about to get into one area where Microsoft’s published documentation is quite weak. Remember the 367 page “Capacity Planning for Microsoft SharePoint Server 2010” whitepaper? Starting at page 326, there is a section with the promising title of “Estimate Core Storage and IOPS needs” (this topic is also available separately as a technet article too). The problem is in despite that title, very little IOPS guidance actually is given. Instead the content in the section overwhelmingly speaks about estimating storage requirements. In fact the best you get is one explicit mention of IOPS in relation to the SharePoint Search service application which states the following:

The IOPS requirements for Search are significant.

  • For the Crawl database, search requires from 3,500 to 7,000 IOPS.
  • For the Property database, search requires 2,000 IOPS.

Note: For the purpose of the rest of this article, lets add the above figures together and simply say between 5,500 to 9000 IOPS for search.

Do you see the problem here? This is simply an arbitrary IOPS figure with no guidance as to the IO patterns underpinning it. What about latency or the IO request size that you need to assume? Unfortunately, no guidance is given for these questions which makes this quoted figure not overly helpful. Plus, as you will soon see below, Microsoft seemingly contradict themselves elsewhere in the same whitepaper…

So what are good numbers to use?

In the absence of any hard data, the best way to deal with storage requirements is to think in terms of lead indicators. Indicators from a lead point of view, can be framed as targets – something to aim for. Targets then can be broken down into different categories ranging from “cover your arse” to “above and beyond”:

  • Aspired target: The “this would be bloody fantastic if we could get there” target.
  • Agreed target: The “this is what we are going to deliver no matter what” target.
  • Minimum Condition of Satisfaction (MCOS) target: The “If we don’t achieve this we may as well pack up and go home” target.

So given these sorts of targets, what should the disk IO performance targets for SharePoint be? To work this out, we can utilise information already out there. Well…that is, we could if the information out there wasn’t so disparate and disconnected. So unfortunately, it takes some digging to you can find what you need.

Our first point of call in this regard is indeed Microsoft and the very same capacity planning and configuration guide that I criticised earlier for poorly dealing with IOPS. Hidden in the bowls of that document, the following statement is made on page 334 (emphasis mine):

Any storage architecture must support your availability needs and perform adequately in IOPS and latency. To be supported, the system must consistently return the first byte of data within 20 milliseconds.

So the way I look at it, a 20ms latency should be our MCOS target (see the explanation above for MCOS). If we consistently do worse than this, then we do not have a lot of assurance about the scalability of the disk IO subsystem being used for SharePoint. But like the arbitrary IOPS figure quoted in the previous section, I wonder if readers have spotted the problem with specifying this latency figure alone?

In both cases, don’t forget the almost symbiotic type of relationships between IO size, IOPS and latency. If we assumed that all IO operations were small (for example SQL’s page size of 8KB) then we could likely stay way under the 20ms limit with a more modest disk infrastructure. But to sustain the same latency with a larger IO size would require a faster disk subsystem. Why? Well as we discussed in part 6, if the size of the IO writes are larger, such as 64KB, then latency will go up because servicing larger requests takes longer than smaller ones. Therefore, if we were to assume a larger IO size, we would need more/faster disks to be able to meet the same 20ms latency KPI.

So what disk IO size should we assume to give context to a latency figure? Some insight can be found back in part 6, when we examined SQL IO characteristics and established that it will likely be much more varied than SQL’s base IO unit of 8KB pages. My suggestion therefore, is to test 8KB but also ensure that 64KB can meet the latency target. This is because 64KB represents a reasonable average size between the 8KB to 256KB range most SQL Server’s IO operations will fall within. Thus, if a SQLIO test using random read/writes at 64K indicates more than 20ms latency consistently, then you should probably ask your storage people to take another look at it.

By the way, if you really want to give your storage guys a challenge, keep jacking up the IO size!

What about aspired latency targets?

So if you are cool with the notion that the minimum condition of satisfaction for a random IO test using 64K size should be less than 20ms latency, what about aiming higher with agreed or aspirational targets?

Luckily for all of us, we can once again stand on the shoulders of giants. In this case, the Bob Duffy indirectly answers this question by providing what he considers to be the indicators for optimal SQL Server performance in general. In an excellent article with the rather appropriate title of “How to Specify SQL Storage Requirements to your SAN Dude” Bob makes the following recommendations:

  • SQL Data files must have a response time averaging about 8ms and a maximum response time of around 20ms using 64k size IOs and that are random in nature
  • SQL Log Files must have a write response time averaging from 1-5ms. use 64k IO size and are sequential in nature

The nice thing about specifying a target or benchmark like this, is that you are able to sidestep discussions on RAID levels, stripe sizes and many other things that SAN nerds find interesting. We keep things focused on the lead indicators and in effect state “If you can meet these figures, configure it any way you like.” This gives the SAN guys the freedom to do their job, while giving you an indicator that can give you confidence in the disk infrastructure. So if we were to distil the figures above into lead indicator targets for storage gurus, it might look something like this:

  • MCOS target: Less than 20ms latency for random IO requests of 64KB
  • Agreed target: Average 8ms latency for random IO requests of 64KB with no more than 20ms max latency. Less than 5ms latency for sequential log IO
  • Aspired target: No more than 8ms latency for random SQL IO requests of 64KB and average of 1ms latency sequential log IO with max never going above 5ms

Now in the ProData article, Bob made a slightly tongue in cheek point that sums up the above thinking really well, as well as giving insight to a critical aspect we have not considered so far…

Nowadays most SQL consultants try and not talk about RAID types and types of disk, it can be best to leave that up to the storage guys. If the storage team can meet my requirement for 5,000 random 64k read/write IOPs at 8ms latency by using 50 old SATA drives at 5,400 rpm in RAID 5 then knock yourself out – I’m happy. Well maybe I’m happy till we have that chat about Service Level requirements during a disk degrade event but that’s a different story…

If you look closely at Bob’s quote, you will see that he has also specified the last critical variable in the mix. Bob’s mention of “5000 random 64k read/write IOPS” is in reference to another point he makes. Without an IOPS figure to work from, the targets we have come up with are effectively meaningless. Quoting Bob:

The main thing to specify apart form your latency requirement is the throughput (IOPs). It is no good meeting the 8ms target for 100 IOPs and then finding your workloads needs 5,000 IOPs. You wont be able to meet the 8ms target!!

Consider it this way… a SharePoint site that services 100,000 users, will process a lot more IO requests than a site that services 10 users. With the latter, it is quite likely that the latency targets we have been talking about (even the aspirational ones) would be pretty easy to meet with a single disk. (To hark back to our shopping centre metaphor, one check out operator is all that is needed at a corner store, whereas many are needed at the supermarket). This is presumably why Bob has used a figure like 5000 IOPS for his post. It is probably a figure that conveniently represents some fairly heavy disk usage. But it does beg two question:

  • How much IOPS should we use to simulate SharePoint IOPS?
  • In the absence of anything else, perhaps 5000 IOPS is a good figure to go with?

Don’t believe all you read…

Now if you go back and read the start of this post, you will recall I mentioned that Microsoft stated some IOPS figures for the SharePoint search application databases ranged between 5,500 to 9000. That would indicate that Bob’s base figure of 5000 is a bit low, especially given that SharePoint has many other components beyond search that have not been taken into account. So to put Bob’s 5000 IOPS figure in perspective, let’s re-examine Microsoft’s trusty capacity planning whitepaper. One of the great things about this document is that Microsoft detailed the performance stats of a typical day in the life of their internal SharePoint environments. Since Microsoft are so large, they have different SharePoint farms for different collaborative scenarios. The scenarios they covered were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. They are typically used for enterprise collaboration, organizations, teams, and projects. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. No custom code runs in these sites.
  3. Departmental Collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department.
  4. Social Collaboration Environment. This is the My Sites scenario. These connect employees with one another and the information that they need. Employees use this environment to present personal information such as areas of expertise, past projects, and colleagues to the wider organization. The environment also hosts personal sites and documents for viewing, editing, and collaboration.

Now reading about these scenarios is highly interesting and Microsoft provides some nice nuggets of information that we will use in a future post. But for now I will stick purely to a disk IOPS perspective. To that end, below are a few fun-filled facts about the number of users in each of the four scenarios:

  1. Enterprise Intranet environment:  33580 unique users per day, with an average of 172 concurrent and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment: 69702 unique users per day, with an average of 420 concurrent users and a peak concurrency of 1433 users
  3. Departmental Collaboration environment. 9186 unique users per day, with an average of 189 concurrent users and a peak concurrency of 322 users
  4. Social Collaboration Environment. 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

So now you have a sense of the size of these scenarios and as an added bonus, gotten a glimpse into the difference that usage patterns can make. For example: social collaboration and enterprise collaboration have similar number of unique users but social has more average concurrency but less peak. But what about IOPS?

In the document, IOPS is split into reads per second and writes per second, so I added them to estimate IOPS. The results are rather surprising…

Metric

Social Collaboration

Departmental Collaboration

Published intranet

Intranet Collaboration

Unique visitors

69814

9186

33580

69702

Average concurrent

639

189

172

420

Max concurrent

1186

322

376

1433

IOPS

941

74

409.66

409.66

Now while it might be tempting to ponder why social collaboration has over double the IOPS, yet half the concurrency of enterprise intranet collaboration, we are not going to worry about here. Besides, we actually covered some of it already when we used logparser to get insights of usage patterns. What I will instead do is draw your attention to is the fact that that none of the IOPS scenarios come anywhere near the 5000 IOPS figures cited by ProData or Microsoft’s 5500-9000 IOPS cited for search (in the very same capacity planning document I might add!)

So something is amiss. If an organisation the size of Microsoft can have almost 70000 unique users per day, with a peak concurrency of 1433 users and only total 410 IOPS, then where the hell did the 5500-9000 IOPS figure for search alone come from? Even if you take the scenario with the highest IOPS (the Social collaboration scenario with 941 IOPS), that’s still less than one fifth 5500 IOPS which was at the low end of the search IOPS figure.

Now I am also suspicious that two different case studies have the exact same IOPS figure. If you compare the “published intranet” scenario with the “intranet collaboration” scenario, one has half the visitors, yet both have precisely the same IOPS (right down to decimal places). That seems highly unlikely to me and I suggest that a mistake has been made. Given the intranet collaboration has the highest max concurrency figure, I would have expected IOPS to be a higher than it is. Hmmm…

What can we take away from this? For one, the capacity planning document could seriously do with a rewrite in this area. Secondly, I don’t have a lot of faith in those IOPS figures quoted (although I have more confidence in the case studies that the arbitrary figures specified for search).

So if we put aside the doubt created by the issues with the capacity planning guide, there is one really interesting fact that remains… none of the reported IOPS figures came anywhere near 5000 IOPS.

Insights from HP…

It turns out that Hewlett Packard also did some load testing of SharePoint 2010 (among other things) and published a whitepaper called the “HP performance and configuration guide for Microsoft SharePoint 2010“.  In this guide, they detail the results of a scenario they tested based on what they termed an “Enterprise Workload”. The guide covers definition of enterprise workload in loving detail, but the gist of it is that it covers the following areas:

  • Document Center (30% of operations) Check-out, download, upload and check-in documents
  • Team Sites – (20% of operations) work with calendars, discussions and documents
  • Portal SItes – (20% of operations) work with event, announcements and surveys
  • My Sites – (10% of operations) work with documents in personal documents library
  • Search – (20% of operations) Submit searches with random word or phrases

HP then simulates 500 concurrent users performing the actions above. In Table 13 of the report (page 28 of their document and reproduced below) , HP outline the performance and even break down the IO characteristics of each SharePoint database (which is really handy indeed). Adding up the last column of transfers/sec (which is essentially IOPS) we get a result of 1347.33 IOPS.

Thus we are still considerably under the 5000 IOPS that Bob Duffy suggests.

Conclusion…

Right! Remember our discussion above on MCOS, agreed and aspired targets? For an aspirational target, I think that we can reasonably use 5000 IOPS as a starting point for an enterprise organisation of Microsoft’s size. If we stick with 5000 IOPS, then my suggestion for an aspirational latency target would be:

  1. no more than 8ms latency for random SQL IO requests of 64KB
  2. average 1ms (and no more than 5ms max) latency of sequential log IO of 64KB

I think these figures are a pretty good test of a disk subsystem and think that Bob at ProData is therefore pretty close to the mark. Of course, you can use these figures to make your own judgement and adjust accordingly. Provided that you think of them as lead indicators that provide you a level of confidence in your disk infrastructure, you now have the tools and knowhow to run the tests too.

So if there was a moral of the story to this post, it would be to not believe everything you read and always verify espoused reality with actual reality via testing. On that note, the next post will finish off our examination of disk performance by going over 2 additional tools that I think are particularly good for testing assumptions. After that, we will be revisiting Microsoft’s case studies, as well as some findings, insights and recommendations from some additional lab scenarios that Microsoft conducted.

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle


Next Page »

Today is: Friday 25 April 2014 |