Confessions of a (post) SharePoint architect: Do not penalise people for learning

This entry is part 3 of 10 in the series confessions
Send to Kindle

Hi all and welcome to another small piece of my SharePoint architect manifesto. In the previous post I introduced you to the notion of f-laws, which were first coined in a book co-authored by Russell Ackoff. In the process of developing a SharePoint governance and information architecture class, I was inspired to use the idea of an f-law because they appealed to my contrarian sense of humour and also contained some interesting nuggets of wisdom. I ended up coming up with a heap of f-Laws for SharePoint, and plan to write an article to cover each one.

In the last post, we learnt that f-Law 1 was all about the contention that the more you try and define what governance is, the less anyone will actually understand it. If you never read the first SharePoint Governance f-law then I suggest you do so, because these articles do tend to build on the foundation set from the previous. On that note, the focus of todays f-law extends on one that Ackoff himself came up with. All I am doing it putting a SharePoint bent on it…

F-Law 2: There is no point in asking users, who don’t know what they want, to say what they want.

This f-law comes with an additional corollary: There is even less point in thinking that you already know what they want! (IT departments – I am looking at you here!)

The definitive way to explain this f-law is to leverage the work of one of my mentors – Jeff Conklin. In 2007 I read a paper of his that literally changed my career. It was titled “Wicked problems and social complexity” and despite me reading many papers since, to this day it is one of the best introductions to complex problem solving you could ever read.

In this paper, Jeff talks about the fact that many prevailing methodologies suggest that the best way to work on a problem is to follow an orderly and linear “top-down” process, working from the problem to the solution. You begin by understanding the problem, including gathering and analysing “requirements” from customers or users. Once you have the problem specified and the requirements analysed, you are ready to formulate a solution and eventually implement that solution. This is illustrated by the red line in the figure below.

image

Now many project managers, cheque signers and just about every program management office I have ever worked with try to operate this way because it promises order, certainty and control. This is understandable when ones performance is being judged on getting stuff done to an agreed time and cost. It is also understandable if you are a manager and will get your ass kicked if you blow your budget. There is only one teeny issue. For some scenarios, it simply does not work.

In Conklin’s paper, he detailed a 1980’s case study at the Microelectronics and Computer Technology Corporation (MCC) that looked into how people solve problems. A number of designers participated in an experiment where they were give a one page problem statement – not too different to many SharePoint business cases I have seen. Participants in the experiment had to design an elevator control system for an office building. Despite participants being experienced and expert integrated-circuit designers, none had ever worked on elevator systems before. Each participant was asked to think out loud while they worked on the problem. The sessions were videotaped and analysed in detail.

Below is what really happened. Check out the green line…

image

Clearly, the subjects in the elevator experiment did not follow a linear approach. They would start by trying to understand the problem, but they would immediately jump into formulating potential solutions. Then they would jump back up to refining their understanding of the problem. Rather than being orderly and staged like the red line, the line plotting the course of their thinking looked more like a seismograph for an earthquake. Now if you are looking at the green line and thinking “my god I better put a stop to that sort of shenanigans,” consider what Conklin had to say about it in his paper:

These designers are not being irrational. They are not poorly trained or inexperienced. Their thought process was something like: “Let’s see, idle elevators should return to the first floor, but then, you only need one elevator on the first floor, so the others could move to an even distribution among the floors. But the elevators need to be vacuumed regularly. I suppose we could add a switch that brought idle elevators down to the first floor. But then what happens in an emergency?” In other words, what is driving the flow of thought is some marvellous internal drive to make the most headway possible, regardless of where the headway happens, by making opportunity-driven leaps in the focus of attention. It is precisely because these expert designers are being creative and because they are learning rapidly that the trace of their thinking pattern is full of unpredictable leaps.

When I am speaking at conferences I like to mess with project managers in particular (who doesn’t eh?), because they are an easy target and already an insecure lot to begin with. I will ask the audience if anybody has an industry certification, such as a PMP or Prince II. Several hands usually go up. I then point to the phases of the above diagram and ask them if – when they were studying hard to obtain their certification – they actually followed the discrete phases that everybody else is supposed to follow. No single person has ever suggested that they did, instead all acknowledging that their process of learning looked more like the green line. I then point out that I’ve always found this perversely funny that people follow the green line to learn a process that tries to insist that everybody else must follow the red line. Ironic huh?

Knowing vs. learning problems

It should be stated at this point that you can use the red-line approach for certain types of problems so I am not outright dismissing it. In fact, within SharePoint projects there are indeed elements that can work quite well this way. The green line on the other hand, represents a phenomenon that Conklin called “Opportunity driven problem solving” and is the de-facto way we work on problems that are new or novel. For example, if you have ever experienced an “aha” moment, it was probably a leap of cognition that followed the jagged green line up or down, where you suddenly saw the problem in a different light. (Shhhh… don’t let your project manager know because you will have to fill in a form of some description!)

In these types of problems, we do not start by gathering and analysing data about the problem because the problem itself is moving target and varies depending on different stakeholders and their world views. Thus, there is no pure and concrete understanding of “the problem” because it is still forming as you think about solutions. In short, the jagged green line is a picture of learning. Quoting again from Conklin:

The more novel the problem, the more the problem solving process involves learning about the problem domain. In this sense the waterfall is a picture of already knowing – you already know about the problem and its domain, you know about the right process and tools to solve it, and you know what a solution will look like. As much as we might wish it were otherwise, most projects in the knowledge economy operate much more in the realm of learning than already knowing. You still have experts, but it’s no longer possible for them to guide the project down the linear waterfall process. In the current business environment, problem solving and learning are tightly intertwined, and the flow of this learning process is opportunity-driven

I believe that many innovations stem from the opportunity driven, creative leap of faith of the green line. On that note, I’d say that one of the greatest opportunity driven learners had to be Einstein. An article by Mort Orman suggests that Einstein was intrigued by “holes” in prevailing theories and enjoyed posing “mind riddles” to himself, just to see if present theories could satisfactorily explain them. Unlike many others who might have given up when they got stuck, Einstein was persistent and kept at it for 10 years before he came up that little formula everyone knows. After explaining this story at conferences, I sometimes ask people “Do you think Einstein used waterfall when he came up with relativity?” No one has said yes yet…

image

Implications…

The pattern of behaviour between the red and green lines represents the difference between a knowing problem and a learning problem. With a knowing problem, the problem is clear to all participants and even though it might require specialist expertise to solve, many of the variables are well understood. But for a problem that is novel or requires learning from participants Conklin’s case study illustrates that:

  1. People will examine potential solutions just to explain the problem.
  2. Each instance of examining the solution will impact the understanding of the problem.

Given that SharePoint is almost always starts out as a learning problem for the majority of participants, I do not see the point in trying to fight that green line. Instead, it is critical that you work with it rather than against it. The difficulty people have with this is that to do so conflicts with that promise of certainty, order and control that the red appears to offer. Why? Well among other things, you have to:

  1. Expect fluid requirements and scope changes
  2. Involve stakeholders from the start (they have to live with the result and their up-take is presumably a key KPI)
  3. Expect resistance and pullback from stakeholders (because people attribute more to what they perceive to lose compared to what they might gain)

Above all, you have to avoid penalising people for their learning. If you put barriers in front of people who are trying to improve their understanding of a multi-faceted problem they will eventually disengage from you. If you want to guarantee this sort of disengagement, go right on ahead and solve some problem before your stakeholders even realise there is a problem. Then when they do realise, smack them with the metaphorical baseball bat known as the scope variation form. One of two of those babies and those annoying stakeholders are guaranteed to go away. A pity your solution will go away with them but hey, it was in scope right?

Confessions…

I deal with this core issue of not penalising learning in a number of ways… some of which I have outlined in blog posts and many that I will cover in detail as we progress in this series and examine more f-laws. If you simply can’t wait for me and want “the answer” then I have news for you. If it were that easy there would be a “best practice” for it and someone would have created a certification by now!

So instead I will give you a couple of KPI’s to work to.

  • KPI 1: You get to a stage where your clients questions are “informed”. It is pretty easy to tell as a SharePoint professional where your stakeholders are at in their understanding of SharePoint. Over time there is a certain level of maturity in the questions asked of you and the way they are asked. This reflects both stakeholder learning as well as your ability to teach. If you get to this stage, you have increased your chances of SharePoint success significantly, which leads onto the next KPI…
  • KPI 2: You get to the stage where your clients are asking you well informed questions that you don’t know the answer to. Trust me that they will not mind that you don’t because their awareness of the product will no longer be naively simplistic anyways. You will also have developed a great collaborative partnership by then too. Also don’t forget my quote from Horst Rittel in the midwives post. There is a symmetry of ignorance with complex problems. The knowledge required to solve a complex problem never resides with a single person.

This leads me onto the final KPI:

  • KPI 3: Your clients should start telling you stuff about SharePoint that they have done that you have never done before or didn’t know you could do. In short, they will start teaching you stuff.

If you can create those conditions, be happy that they don’t need you that much anymore.

 

 

Thanks for reading

 

Paul Culmsee

www.hereticsguidebooks.com

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Confessions of a (post) SharePoint Architect: Don’t define “governance”

This entry is part 2 of 10 in the series confessions
Send to Kindle

Hi all and welcome to the second post of a series that I have been wanting to write for a while. In this series, I am going to cover some of the lesser considered areas of being a SharePoint architect and by association, key aspects to SharePoint governance. In the first confessional post I alluded to the fact that a good SharePoint architect also need to architect the right conditions for SharePoint success. As I work through this series of articles, I will elaborate further on what those conditions are and how to go about creating them.

To do this, I am drawing from my non IT work as a Dialogue Mapper and facilitator, and where applicable, will cover these case studies to see if they give us any insights for SharePoint. I also hope to dispel some common myths and misconceptions about SharePoint project delivery in organisations. Some of these might challenge some notions you hold dear. But for the most part, I hope that many of you reading this find this material to be instinctively compatible with what you have already come to believe. If you are in the latter group and feel as if you are an organisational agitator, this just might give you that rigour and ammo that you need when getting through to the powers that be. Better still, tell them to read this series and let them decide for themselves.

Backstory: Ackoff and f-Laws

image

For what it’s worth, a fair chunk of this material comes from my book, as well as the first module of my SharePoint Governance and Information Architecture class that I run a few times a year in various places around the world. When I designed that class, I was inspired by Russell Ackoff, who co-wrote a funny and highly readable book called “Management f-LAWS: How organisations really work”. F-Laws were defined as:

“truths about organisations that we might wish to deny or ignore – simple and more reliable guides to everyday behaviour than the complex truths proposed by scientists, economists, sociologists, politicians and philosophers”

In case you hadn’t noticed, if you remove the hyphen, each f-law become a flaw. You could also consider them as #fail laws. Years ago, I laughed and at the same time, inwardly cringed when I read each f-Law that Ackoff and his co-authors had come up with. I came to realise that SharePoint problems are simply a microcosm of broader issues that plague organisations. If you read Ackoff’s book (and I highly recommend it), you will soon realise that the word “Management” could easily be substituted with “SharePoint” and it doesn’t take much to come up with a few of your own f-laws. This is exactly what I did and at last count, I have 17 of them. In this post, I will detail the very first one.

F-Law 1: The more comprehensive the definition of governance is, the less it will be understood by all

The first condition that I need to design as a SharePoint architect is to put to bed the many misconceptions about SharePoint governance. In this f-Law, I state that the more you try and define what SharePoint governance is, the less anybody will actually understand it. If you consider this counter-intuitive, then let me take it even further. For any project that has a change management aspect (SharePoint projects often are), definitonising not only doesn’t work, but it is actually quite dangerous to your projects health.

To explain why I have come to this conclusion, I’d like to tell you a little story from my non IT work. Several years ago, I was working in a sensemaking capacity with an organisation to help them come up with a strategic plan and performance framework for a new city. This was not a trivial undertaking. The aim was to create a framework with an aligned set of KPI’s to realise the vision for what the city needed to be in the year 2030. While the vision for the city had been previously agreed and understood, the path to realise that vision had not been.

Now if you have ever been involved in strategic plan development, and think that working out your corporate strategy is difficult, I have news for you. Aligning an organisation to a 3 year plan is one thing. Working with a diverse group to determine performance measures of a future city 25 years away is a different thing altogether. I never realised at the time we did this work, just how unique and (dare I say) “cutting edge” it was. Participants were highly varied in skills and areas of interest, and to say each had their own world-view was an… understatement to say the least.

I my book I describe this case study in detail but for the sake of post size, let’s just say that the opportunity to do this work arose from a failed first attempt to create the framework. The first time around an excel spreadsheet was projected onto the wall that looked like the example below. Attempts were made in vain to fill in the strategic outcomes, strategic objectives, key result areas, key performance indicators and measures. After a frustrating few hours of trying this approach, we gave up because participants spent all of their time arguing over the labels and got bogged down in a tangle of definitions and ambiguous terminology. Was it a KPI (Key Performance Indicator) or a KRA (Key Result Area)? Was it a Guiding Principle or a Strategic Objective? Was it a KRA or a Critical Success Factor?  Attempts to resolve this issue with definitions got nowhere because even the definitions could not be agreed upon.

image

In the end, we solved this issue via a rather novel use of Dialogue Mapping along with a problem structuring approach outlined in a book called Breakthrough Thinking. If you’d like to know more on how it was done, then take a look in chapter 12 of the Heretics Guide.

The criticality of context…

The core problem boiled down to context – or lack of it. What I learnt from this is that in situations without a shared context (and the wrong tools to deal with it), we fall back to using definitions to try and fill the gap. When faced with a blank spreadsheet and just some labels, participants attention was fixated on the definitions of the labels, rather than the empty cells where the focus needed to be. This resulted in a bunch of long winded discussions about what terms meant. This seriously stymied efforts aimed at making progress.

I have since performed many workshops, both SharePoint and non SharePoint ones and the pattern is clear. In fact I contend that if you proceed down the road of trying to build context via definitions for complex problems, one of three things will happen.

  1. The definition becomes more verbose. There are a couple of reasons for this:
    • – The definition is expanded to incorporate new aspects of the topic space. In an organisational setting, this creates confusion because the definitions of multiple disciplines can often seemingly contradict each other and thus, careful “wordsmithing” is required to navigate a path through it.
    • – New qualifications or exceptional situations have to be excluded. This leads to more new terms being used in the definition.
  2. As a result of #1, a broader, fundamental definition is developed. This broader definition encompasses more and so is prone to motherly sounding platitudes. Further, such definitions also run the risk of being interpreted in ways other than the one intended by those who worked so hard on the definition.
  3. As a result of #1 and #2, a new word is used or an existing word is used in a new context to try and convey the new meanings or concepts proposed. I have heard governance described as “stewardship”, “risk management” and (guilty as charged), “assurance”.

The effect of this can be far reaching in a bad way because definitionising has a habit of blinding people to what really matters. This leads to terrible project decisions being made up front that have serious consequences. To understand why, consider the image below:

Snapshot

This image represents how governance of a SharePoint project should be viewed. A SharePoint initiative takes time and effort which costs money. We presumably have recognised that the present state is lacking in some way and want to get to somewhere better – an aspirational future place (if you look closely in the left above image note the happy and sad smilies). Accordingly we accept the cost of deploying SharePoint because we believe it will make a positive difference by doing so. If this was not the case, you would be wasting your time and resources on a pointless initiative. Therefore, it is the difference made by the initiative that will tell you if you have succeeded or not. As a result, we have to have a shared context on what that aspirational future looks like!

Don’t confuse the means with the ends…

Governance is, therefore, the means by which you will achieve the end of getting to some better place. It is informed by the end in mind and this is why I drew it in the star in the middle of the above diagram. For example; If the end in mind was compliance, then I will govern SharePoint a heck of a lot differently than say, if the end in mind was improving collaborative decision making.

But consider the diagram below. In this context, it should be clear why working from a definition of governance is often problematic. It implies that:

  1. Governance is not being informed by the end in mind;
  2. Your team do not have a shared understanding of what the end in mind actually looks like.

When this happens, project teams rarely realize it and respond by substituting the end with the means. We overly focus on governance via definition without any clarity or context as to what the aspirational future state actually is. Like the example of the blank spreadsheet example I started with, reality starts to look more like the diagram below: (note the happy smilie is gone now)

Snapshot

Steering…

So how do we steer out of this definition pickle? Interestingly “steer” is a appropriate choice of word if we look at the origin of the word “Govern”. This is because “Govern” is a nautical term from Latin and actually means “to steer”. So if your SharePoint project has been more like the Titanic, and hit a giant iceberg along the way, then clearly you need to focus your governance efforts to looking at what is in front of you, rather than scrubbing the deck or keeping the engine room well oiled. The latter tasks are important of course, but you can do all that, still hit an iceberg and waste a lot of money.

To steer, we all have to understand what the destination is, or at the very least, all agree on the direction. To help you with that journey, consider my final diagram. To steer SharePoint the right way for your organisation requires you to answer four key questions:

  1. What is the aspirational future state and what does it look like?
  2. Why is this the aspirational future state we want?
  3. Who will do what to get us to that state?
  4. How will we get to that state?

Snapshot

The fundamental problem with most SharePoint projects is that questions 1 and 2 are not answered sufficiently, if at all. The next few posts will explore why this is the case, but in the meantime, remember that we could do a SharePoint project that is to scope, time and cost, yet still have no user up-take if we are solving the wrong problem in the first place. Therefore remember that:

  1. Governance is a means to an end, and not the end in itself.
  2. We shouldn’t undertake a “SharePoint governance” project, or consider “SharePoint governance” as deliverable on a project plan. The act of developing a shared context of what the problems are and using that to always steer the governance decision making is paramount. Failure to do this and and your best plans will not save you.

Conclusions and coming next…

This is the second post on what will be a large series – possibly the largest series I have written so far. In the next post in this series, I will continue into our journey of SharePoint governance mistakes and along the way, start to identify what we can do to better answer the “What”, “Why”, “Who” and “How” questions. If you enjoy this series, then consider signing up to one of my classes if one is running in your neck of the woods.

 

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Confessions of a (post) SharePoint Architect: Midwives versus doctors

This entry is part 1 of 10 in the series confessions
Send to Kindle

Bjørn Furuknap you have gone too far this time! There I said it.

On behalf of the SharePoint community, I feel that someone needs to speak up about your reprehensible behaviour, so I have taken it upon myself to right the wrongs that you so needlessly inflicted onto the community. You see, Bjørn has gone and written a non controversial post on the confused role of the SharePoint architect. I am extremely concerned about this abrupt change of behaviour and worry about the example he is setting for the young and impressionable members of the SharePoint community. If Bjørn keeps going down this rational road, then he will make the rest of us look irrational and tip the delicate balance of the SharePoint blogging ecosystem into unknown territory. In other words, we will lose the excuse of “Well, at least I’m not as nuts as Furuknap!”

That said, I have been meaning to write about insights from my life as a (post) SharePoint architect anyway. I have a few of my own lessons learnt and Bjorn has inspired me to finally get a few written down. So in this preamble post, and in a forthcoming series on common SharePoint governance mistakes, I will give you a dose of the opinionated world according to Paul, but I will back it up with some juicy references that you can check out for yourself if you are that way inclined.

Why (post) SharePoint Architect?

You might be wondering why I referred to myself as a “post” SharePoint architect. Unfortunately its hard to answer this question without sounding self-indulgent so I will keep it brief.

In 2007, I got my first non IT gig in a highly complex urban planning project. I had no contribution to make in terms of technical or discipline knowledge to this project at all. My job was to enable others to develop a shared understanding on a highly complex problem they all faced, to enable shared commitment to a course of action. Since that time, this non IT side of my work has continued to grow in terms of number of clients and the scale of the problems being tackled. Like any skill, I have gotten better with practice, which in turn has led to larger and more complex scenarios.

This year in particular, I’ve helped the executive teams of several large organisations re-find their purpose, realign their strategy and make some very difficult and courageous decisions in redesigning their organisations. Just to be clear, we are not talking SharePoint and we are not talking IT. I am talking about how these organisations adapt to changing conditions that, in some cases, affect their very existence. These organisations span the public and private sector across Australia.

From a SharePoint perspective, you could say I have moved from the server room to the meeting room and now to the boardroom. In spite of my self-indulgence warning earlier, you have to admit – this is damn cool!

Who’s misunderstood anyway?

So with that little preamble done, let me return to Bjørn’s post. He feels the SharePoint architect role is misunderstood and I agree with this, but in a different way. I feel the core issue is that SharePoint architects themselves are often the ones who misunderstand what they need to do and how they should go about it. This in turn manifests in the rest of the world not understanding what they ought to be doing.

To elaborate on this contention, let’s meet the four most common SharePoint architect stereotypes that I see in organisations:

The SharePoint architect who used to be a developer

This stereotype comes in two flavours. The alpha developer who had attained top dog status among peers via bluffed programming prowess, or the developer who always struggled and finds that this is a way to get out of hands on coding. Either way, this person still lives through the SharePoint object model. They will focus on ensuring that there are unit tests, solid source control and solutions packaging regime. Utilising the object oriented view means that metadata is king and folders are to be despised. They will not think twice about utilising content types in any situation because it is completely obvious that you would work this way. Their information architecture will be a work of art, and they are not shy in telling everyone so. To sum up, their SharePoint solutions will be logical, well coded to defined standards and completely useless to users.

The SharePoint architect who used to be an infrastructure guy

This stereotype tends to bewilder clients and colleagues alike with a seemingly endless set of options and considerations that need to be made up front. It is likely this architect will introduce SharePoint via the pie/frisbee diagrams, but discussions will focus on architecting for scalability, security and fault tolerance. This architect will likely mandate strict governance rules on those cowboy developers, untrustworthy site admins, and downright scary users to ensure that the environment remains pristine. Accordingly, SharePoint Designer will be outlawed – it’s so obvious that one shouldn’t even be asking why. Any burden imposed by these governance rules will be seen as a necessary evil and will be addressed by mandatory user training and besides, the next SharePoint version will definitely address the gaps. Their solutions will be scalable, architecturally sound and completely useless to users.

The SharePoint architect who thinks they are an enterprise architect

This stereotype – despite their obvious protests to the contrary – is the right brained equivalent of the infrastructure guy. This person absolutely gets off on making models because conceptual reality does not involve making any actual commitments. In fact as soon as there is any push to a commitment, they feel an irrepressible urge to resist and push everyone back into make-believe world. Over-utilising the line “Oh, I am business you see, not technical” as if it’s a sign of maturity, they will plan, plan and plan again, drawing many cool diagrams on whiteboards but never a task on a Gantt chart. The models they come up with are abstract, over-engineered and they always fall deeply in love with them. They won’t let anybody touch their models for fear of their abstract thing of beauty being messed with. The irony is that the basis for their models will actually be underpinned by some solid theoretical frameworks. Unfortunately, this person actually doesn’t understand them in any depth, but the terms used sound really cool. Their solutions will be … Wait, who am I kidding? They won’t have any solutions because they plan forever…

The SharePoint project manager who thinks they are an architect

This stereotype is arguably the most dangerous of the lot because they are driven by the need to “Get Things Done Now!” whether those “Things” make sense or not. Consequently, they jump into solution mode without a full understanding of the real problem the business is facing. Scope documents, plans, schedules and Gantt Charts abound, but the chances are that all these are geared towards solving the wrong problem. Talking to annoying stakeholders just gets in the way of the scope statement and besides, that’s what Business Analysts are put on this earth to do anyway. Their solutions will be built to time and cost, but completely useless to users.

“Let’s drill halfway…”

Bjørn also spoke of architects needing breadth of knowledge over depth of knowledge. This is completely true, but is not the full story. You see, there is a common bad habit that our stereotype architects often make; a bad habit that is in common with many other consultants who end up doing damage to organisations.

Irrespective of their breadth of knowledge or otherwise, these architects act like doctors prescribing remedies. They breeze into organisations, making sweeping statements that contain cool sounding maxims like “business value.” Then, using their clearly superior intellect based on years of experience and that cherished breadth of knowledge, they assess the organisational symptoms and prescribe the appropriate SharePoint medicine to address them.

I can hear it now… “Got an organisational headache? Just take this SharePoint content-type three times a day and see me if pain persists.”

I’m sure people can see the obvious problems with this approach (If you can’t then you are in the wrong business – seriously). One of the many issues is that organisational symptoms are often just visible manifestations of deeper underlying issues. The late, great Russell Ackoff once stated that you would not use brain surgery to cure a headache, despite the pain being felt in your head. Instead you would take a pill, even though there appears to be no direct relationship between the pill and the pain being experienced. Ackoff mused that organisations routinely use brain surgery for their headaches and tools like SharePoint are the blunt instrument of choice to do the drilling. Add to this, the technical complexity of SharePoint means that brain surgery has to happen in discrete phases.

“Okay guys we don’t have enough budget for this, so let’s drill halfway into the skull for phase 1.”

The SharePoint midwife…

SharePoint architects have to understand that the solutions they architect are actually not for them. “Gee Paul that’s profound,” I hear you say sarcastically. While this statement might sound obvious, why is it that many architects exhibit behaviours that contradict it?

If you want to know why this happens I suggest that you read part 1 of the Heretics book. But rather than rehash that here, let’s see what we can learn about problem solving from the insights of Horst Rittel and Ron Heifetz. In case you are wondering, no they are not SharePoint MVPs. Rittel coined the term “wicked problem” and is highly influential in various fields due to his early insights into complex problem solving. Heifetz is well known for his work on the theory and practice of adaptive leadership: how to mobilise people through what he termed adaptive change.

Note: If you have not heard of the term “wicked problem”, then go and read this old post of mine. It’s assumed knowledge here…

Rittel stated that when solving problems, nobody wants to be “planned at.” Additionally, the knowledge required to solve a complex (wicked) problem never resides with a single person. Instead, there is a symmetry of ignorance (I love that term). Rittel characterised symmetry of ignorance as situations “where both expertise and ignorance is distributed over all participants and no-one ‘knows better’ by virtue of degrees or status.” Accordingly, the process of problem solving must involve those who are directly affected by the problem. These are the key stakeholders “living” the problem, rather than experts who “know” the problem theoretically. The aforementioned experts should guide the process of dealing with a wicked problem but not impose solutions. In Rittel’s words, the planner is the “midwife of problems rather than the offerer of therapies.” It is the group that must come up with the answers.

Ron Heifetz echoed Rittel in his advice to leaders. One key strategy of adaptive leadership is to give the work back. Heifetz warned that when a leader undertakes to solve a problem, the leader becomes the problem in the eyes of many stakeholders. The implication is that the leader also becomes the convenient scapegoat if the solution goes awry as blame can be attributed to the leader. Instead by placing work where it belongs—with employees responsible for doing the work—Heifetz argued that issues will be internalized and owned by the parties best placed to deal with them. The best solutions, he maintained, are when the people with the problem become the people with the solution.

Confessions…

Given what Rittel and Heifetz have to say, it should be little wonder that I feel SharePoint architects should not be doctors prescribing remedies. SharePoint is often an adaptive change because you are asking people to change their behaviours. Those architects (and management consultants) who act like doctors tend to find out fairly quickly that the solutions they so lovingly come up with do not always get traction. Therefore, as Rittel suggests, a SharePoint architect needs to be more of a midwife than a doctor. It’s the client who is giving birth to this thing and you are there to create the conditions for them to make that journey as stress free as one can. Who is the one who has to adapt and live with the result anyway? Certainly not the consultant.

For some, this comes at a cost to architect ego because architects often have to let go of their creations. An architect cannot revel in the glory of their masterpiece if those affected by it do not buy into it. It will have a crappy legacy no matter what the intent. In letting go, one has to accept that stakeholders will also have an incomplete world view and will make mistakes. Therefore as an architect, how about architecting not just the SharePoint platform, but architecting the conditions by which SharePoint is delivered.

An obvious condition is one of real collaboration among stakeholders (which when you think about it, is kind of important when putting in collaborative systems!) Another condition that should be there is one that allows people to fail forward. Assume that mistakes will be made, take away the blame and architect SharePoint to be resilient in the face of change instead of making it brittle. Create the environment conducive to co-creation by painting part of the picture and allow participants to fill it in. After all, the learning that occurs via the journey is often just as important as the result achieved.

My confession is that I often say to my SharePoint clients that it is inherently more efficient for me to transfer my knowledge of SharePoint to them, than for them to transfer their deep knowledge of their organisation to me. To do the latter would be highly inefficient for both my clients and me, and my clients would not have the same opportunity to build their own SharePoint competencies and adaptive capacity. At the end of the day, they architect a lot of the solutions. Sure… I might offer suggestions here and there, and I might nudge them when I feel they need to be nudged, but more often than not, I lay some core foundations and they are the ones who do a lot of the legwork.

So to conclude, while I agree with everything Bjørn said in his post, I think the real key to being a good SharePoint architect is to architect the conditions by which SharePoint is delivered, just as much as SharePoint itself. While being a midwife may not be as glamorous as being a doctor, the solutions delivered will have more staying power.

 

Thanks for reading

Paul Culmsee

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Demystifying SharePoint Performance Management Part 11 – Tales from the Microsoft labs

This entry is part 11 of 11 in the series Perf
Send to Kindle

Hi all and welcome to the final article in my series on SharePoint performance management – for now anyway. Once SharePoint 2013 goes RTM, I might revisit this topic if it makes sense to, but some other blogging topics have caught my attention.

To recap the entire journey, the scene was set in part 1 with the distinction of lead and lag indicators. In part 2, we then examined Requests per Second (RPS) and looked at its strengths and weakness as a performance metric. From there, we spent part 3 looking at how to leverage RPS via the Log Parser utility and a little PowerShell goodness. Part 4 rounded off our examination of RPS by delving deeper into utilising log parser to eke out interesting RPS related performance metrics. We also covering the very excellent SharePoint Flavored Weblog Reader utility, which saves a bunch of work and can give some terrific insights. Part 5 switched tack into the wonderful world of latency, and in particular, focused on disk latency. Part 6 then introduced the disk performance metrics of IOPS, MBPS and their relationship to latency. We also looked at typical SharePoint and SQL Server disk IO characteristics and then examined the pros and cons of RPS, IOPS, Latency, MBPS and how they all relate to each other. In part 7 and continuing into part 8, we introduced the performance monitor counters that give us insight into these counters, as well as introduced the SQLIO utility to stress test disk infrastructure. This set the scene for part 9, where we took a critical look at Microsoft’s own real-world findings to help us understand what suitable figures would be. The last post then introduced a couple of other excellent tools, namely Process Monitor and Windows Performance Analysis Toolkit that should be in your arsenal for SharePoint performance.

In this final article, we will tie up a few loose ends.

Insights from Microsoft labs testing…

In part 9 of this series, I examined Microsoft’s performance figures reported from their production SharePoint 2010 deployments. This information comes from the oft mentioned SharePoint 2010 Capacity Planning Guide. Microsoft are a large company and they have four different SharePoint farms for different collaborative scenarios. To recap, those scenarios were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company. Performance reported for this environment was 33580 unique users per day, with an average of 172 concurrent users and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. Performance reported for this environment was double the first environment with 69702 unique users per day. Concurrency was more than double, with an average of 420 concurrent users and a peak concurrency of 1433 users.
  3. Departmental collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department. Performance reported for this environment was a much lower figure of 9186 unique users per day (which makes sense given it is departmental stuff). Nevertheless, concurrency was similar to the enterprise intranet scenario with an average of 189 concurrent users and a peak concurrency of 322 users.
  4. Social collaboration environment. This is Microsoft’s My Sites scenario, connecting employees with one another and presenting personal information such as areas of expertise, past projects, and colleagues to the wider organization. This included personal sites and documents for collaboration. Performance reported for this environment was 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

Presented as a table, we have the following rankings:

Scenario Social Collaboration Enterprise Intranet Collaboration Enterprise Intranet Departmental Collaboration
Unique Users 69814 69072 33580 9186
Avg Concurrent 639 420 172 189
Peak  Concurrent 1186 1433 376 322

When you think about it, the performance information reported for these scenarios are lag indicator based. That is, they are real-world performance statistics based on a pre-existing deployment. Thus while we can utilise the above figures for some insights into estimating the performance needs of our own SharePoint environments, they lack important detail. For example: in each scenario above, while the SharePoint farm topology was specified, we have no visibility into how these environments were scaled out to meet performance needs.

Some lead indicator perspective…

Luckily for us, Microsoft did more than simply report on the performance of the above four collaboration scenarios. For two of the scenarios Microsoft created test labs and published performance results with different SharePoint farm topologies. This is really handy indeed, because it paints a much better lead indicator scenario. We get to see what bottlenecks occurred as the load on the farm was increased. We also get insight about what Microsoft did to alleviate the bottlenecks and what sort of a difference it made.

The first lab testing was based off Microsoft’s own Departmental collaboration environment (the 3rd scenario above) and is covered on pages 144-162 of the capacity planning guide. The other lab was based off the Enterprise Intranet Collaboration Environment (the 2nd scenario above) and is the focus of attention on pages 174-197. Consult the guide for full detail of the tests. This is just a quick synthesis.

Lab 1: Enterprise Intranet Collaboration Environment

In this lab, Microsoft took a subset of the data from their production environment using different hardware. They acknowledge that the test results will be affected by this, but in my view it is not a show stopper if you take a lead indicator viewpoint. Microsoft tested web server scale out initially by starting out with a 3 server topology (one web front end server, one application server and one database server). They then increased the load on the farm until they reached a saturation point. Once this happened, they added an additional web server to see what would happen. This was repeated and scaled from one Web server (1x1x1) to five Web servers (5x1x1).

The transactional mix used for this testing was based on the breakdown of transactions from the live system. Little indication of read vs. write transactions are given in the case study, but on page 152 there is a detailed breakdown of SharePoint traffic by type. While I won’t detail everything here, regular old browser traffic was the most common, representing 36% of all test traffic. WebDAV came in second (WebDAV typically includes office clients and windows explorer view) representing 28.12 of traffic and outlook sync traffic third at 7.04%.

Below is a table showing the figures where things bottlenecked. Microsoft produce many graphs in their documentation so the figures below are an approximation based on my reading of them. It is also important to note that Microsoft did not perform tests while search was running, and compensated for search overhead by defining a max CPU limit for SQL Server of 80%.

1*1*1 2*1*1 3*1*1 4*1*1 5*1*1
Max RPS 180 330 510 560 565
Sustainable RPS 115 210 305 390 380
Latency .3 .2 .35 .2 .2
IOPS 460 710 910 920 840
WFE CPU 96% 89% 89% 76% 58%
SQL CPU 17% 33% 65% 78% 79%

For what its worth, the sustainable RPS figure is based on the server not being stressed (all servers having less than 50% CPU). Looking at the above figures, several things are apparent.

  1. The environment scaled up to four Web servers before the bottleneck changed to be CPU usage on the database server
  2. Once database server CPU hit its limits, RPS on the web servers suffered. Note that RPS from 4*1*1 to 5*1*1 is negligible when SQL CPU was saturated.
  3. The addition of the fourth Web server had the least impact on scalability compared to the preceding three (RPS only increased from 510 to 560 which is much less then adding the previous web servers). This suggests the SQL bottleneck hit somewhere between 3 and 4 web servers.
  4. The average latency was almost constant throughout the whole test, unaffected by the number of Web servers and throughput. This suggests that we never hit any disk IO bottlenecks.

Once Microsoft identified the point at which Database server CPU was the bottleneck (4*1*1), they added an additional database server and then kept adding more webservers like they did previously. They split half the content databases onto one SQL server and half on the other. It is important to note that the underlying disk infrastructure was unchanged, meaning that total disk performance capability was kept constant even though there were now two database servers. This allowed Microsoft to isolate server capability from disk capability. Here is what happened:

4*1*1 4*1*2 6*1*2 8*1*2
RPS 560 660 890 930
Latency .2 .35 .2 .2
IOPS 910 1100 1350 1330
WFE CPU 76% 87% 78% 61%
SQL CPU 78% 33% 52% 58%

Here is what we can glean from these figures.

  1. Adding a second database server did not provide much additional RPS (560 to 660). This is because CPU utilization on the Web servers was high. In effect, the bottleneck shifted back to the web front end servers.
  2. With two database servers and eight web servers (8*1*2), the bottleneck became the disk infrastructure. (Note the IOPS at 6*1*2 is no better than 8*1*2).

So what can we conclude? From the figures shown above, it appears that you could reasonably expect (remember we are talking lead indicators here) that bottlenecks are likely to occur in the following order:

  1. Web Server CPU
  2. Database Server CPU
  3. Disk IOPS

It would be a stretch to suggest when each of these would happen because there are far too many variables to consider. But let’s now examine the second lab case study to see if this pattern is consistent.

Lab 2: Divisional Portal Environment

In this lab, Microsoft took a different approach from lab we just examined. This time they did not concern themselves with IOPS (“we did not consider disk I/O as a limiting factor. It is assumed that an infinite number of spindles are available”). The aim this time was to determine at what point a SQL Server CPU bottleneck was encountered. Based on what I have noted from the first lab test above, unless your disk infrastructure is particularly crap, SQL Server CPU should become a bottleneck before IOPS. However, one thing in common with the last lab test was that Microsoft factored in the effects of an ongoing search crawl by assuming 80% SQL Server CPU as the bottleneck indicator.

Much more detail was provided on the transaction breakdown for this lab. Page 181 and 182 lists transactions by type and and unlike the first lab, whether they are read or write. While it is hard to directly compare to lab 1, it appears that more traffic is oriented around document collaboration than in the first lab.

The basic methodology of this was to start off with a minimal farm configuration of a combined web/application server and one database server. Through multiple iterations, the test ended with a configuration of three Web servers, one application server, one database server.  The table of results are below:

1*1 1*1*1 2*1*1 3*1*1
RPS 101 150 318 310
Sustainable RPS 75 99 191 242
Latency .81 .85 .6 .8
Users simulated 125 150 200 226
WFE CPU 86% 36% 76% 42%
APP CPU NA 41% 46% 44%
SQL CPU 18% 32% 56% 75%

Here is what we can glean from these figures.

  1. Web Server CPU was the first bottleneck encountered.
  2. At a 3*1*1 configuration, SQL Server CPU became the bottleneck.  In lab 1 it was somewhere between the 3rd and 4th webserver.
  3. RPS, when CPU is taken into account, is fairly similar between each lab. For example, in the first lab, the 2*1*1 scenario RPS was 330. In this lab it was 318 and both had comparable CPU usage. The 1*1*1 test, had differing results (101 vs 180) , but if you adjust for the reported CPU usage, things even up.
  4. With each additional Web server, increase in RPS was almost linear. We can extrapolate that as long as SQL Server is not bottlenecked, you can add more Web servers and an additional increase in RPS is possible.
  5. Latencies are not affected much as we approached bottleneck on SQL Server. Once again, the disk subsystem was never stressed.
  6. The previous assertion that bottlenecks are likely to occur in the the order of Web Server CPU, Database Server CPU and then Disk subsystem appears to hold true.

Now we go any further, one important point that I have neglected to mention so far is that the figures above are extremely undesirable. Do you really want your web server and database server to be at 85% constantly? I think not. What you are seeing above are the upper limits, based on Microsoft’s testing. While this helps us understand maximum theoretical capacity, it does not make for a particularly scalable environment.

To account for the issue of reporting on max load, Microsoft defined what they termed as a “green zone” of performance. This is a term to describe what “normal” load conditions look like (for example, less than 50% CPU) and they also provided RPS results for when the servers were in that zone. If you look closely in the above tables you will see those RPS figures there as I labelled them as “Sustainable RPS”.

In case you are wondering, the % difference between sustainable RPS and peak RPS for each of the scenarios ranges between 60-75% of the peak RPS reported.

Some Microsoft math…

In the second lab, Microsoft offers some advice on how translate their results into our own deployments. They suggest determining a users to RPS ratio and then utilising the green zone RPS figures to estimate server requirements. This is best served via their own example from lab 2: They state the following:

  • A divisional portal in Microsoft which supports around 8000 employees collaborating heavily, experiences an average RPS of 110.
  • That gives a Users to RPS ratio of ~72 (that is, 8000/110). That is: 72 users will amount to 1 RPS.
  • Using this ratio and assuming the sustainable RPS figures from lab 2 results, Microsoft created the following table (page 196) to suggest the number of users a typical deployment might support.

image

A basic performance planning methodology…

Okay.. so I am done… I have no more topics that I want to cover (although I could go on forever on this stuff). Hopefully I have laid out enough conceptual scaffolding to allow you to read Microsoft’s large and complex documentation regarding SharePoint performance and capacity guide with more clarity than before. If I were to sum up a few of the key points of this 11 part exploration into the weird and wonderful world of SharePoint performance management it would be as follows:

  1. Part 1: Think of performance in terms of lead and lag indicators. You will have less of a brain fart when trying to make sense of everything.
  2. Part 2: Requests are often confused with transactions. A transaction (eg “save this document”) usually consists of multiple requests and the number of requests is not an indicator of performance. Look to RPS to help here…
  3. Part 3 and 4: The key to utilising RPS is to understand that as a counter on its own, it is not overly helpful. BUT it is the one metric that you probably have available in lots of detail, due to it being captured in web server logs over time. Use it to understand usage patterns of your sites and portals and determine peak usage and concurrent usage.
  4. Part 5: Latency (and disk latency in particular) is both unavoidable, yet one of the most common root causes of performance issues. Understanding it is critical.
  5. Part 6: Disk latency affects – and is affected by – IOPS, IO size and IO patterns. Focusing one one without the others is quite pointless. They all affect each other so watch out when they are specified in isolation (ie 5000 IOPS).
  6. Part 6, 7 and 8:  Latency and IOPS are handy in the  sense that they can be easily simulated and are therefore useful lead indicators. Test all SQL IO scenarios at 8k and 64K IO size and ensure it meets latency requirements.
  7. Part 9: Give your SAN dudes a specified IOPS, IO Size and latency target. Let them figure out the disk configuration that is needed to accommodate. If they can make your target then focus on other bottleneck areas.
  8. Part 10: Process Monitor and Windows Performance Analyser are brilliant tools for understanding disk IO patterns (among other things)
  9. Part 9 and 11: Don’t believe everything you read. Utilise Microsoft’s real world and lab results as a guide but validate expected behaviour by testing your own environment and look for gaps between what is expected and what you get.
  10. Part 11: In general, Web Server CPU will bottleneck first, followed by SQL Server CPU. If you followed the advice of points 6 and 7 above, then disk shouldn’t  be a problem.

Now I promised at the very start of this series, that I would provide you with a lightweight methodology for estimating SharePoint performance requirements. So assuming you have read this entire series and understand the implications, here goes nothing…

If they can meet this target, skip to step 8.  🙂

If they cannot meet this, don’t worry because there are two benefits gained already. First, by finding that they cannot get near the above figures, they will do some optimisation like test different stipe sizes and check some other common disk performance hiccups. This means they now better understand the disk performance patterns and are thinking in terms of lead indicators. The second benefit is that you can avoid tedious, detailed discussions on what RAID level to go with up front.

So while all of this is happening, do some more recon…

  • 4. Examine Microsoft and HP’s testing results that I covered in part 9 and in this article. Pay particular attention to the concurrent users and RPS figures. Also note the IOPS results from Microsoft and HP testing. To remind you, no test ever came in over 1400 IOPS.
  • 5. Use logparser to examine your own logs to understand usage patterns. Not only should you eke out metrics like max concurrent users and RPS figures, but examine peak times of the day, RPS growth rate over time, and what client applications or devices are being used to access your portal or intranet.
  • 6. Compare your peak and concurrent usage stats to Microsoft and HP’s findings. Are you half their size, double their size? This can give you some insight on a lower IOPS target to use. If you have 200 simultaneous users, then you can derive a target IOPS for your storage guys to meet that is more modest and in line with your own organisations size and make-up.

By now the storage guys will come back to you in shock because they cannot get near your 5000 IOPS requirement. Be warned though… they might ask you to sign a cheque to upgrade the storage subsystem to meet this requirement. It won’t be coming out of their budget for sure!

  • 7. Tell them to slowly reduce the IOPS until they hit the 8ms and 1ms latency targets and give them the revised target based on the calculation you made in step 6. If they still cannot make this, then sign the damn cheque!

At this point we have assumed that there is enough assurance that the disk infrastructure is sound. Now its all about CPU and memory.

  • 8. Determine a users to RPS ratio by dividing your total user base by average RPS (based on your findings from step 5).
  • 9.  Look at Microsoft’s published table (page 196 of the capacity planning guide and reproduced here just above this conclusion). See what it suggests for the minimum topology that should be needed for deployment.
  • 10. Use that as a baseline and now start to consider redundancy, load balancing and all of that other fun stuff.

And there you have it! My 10 step dodgy performance analysis method.  Smile

Conclusion and where to go next…

Right! Although I am done with this topic area, there are some next steps to consider.

Remember that this entire series is predicated on the notion that you are in the planning stage. Let’s say you have come up with a suggested topology and deployed the hardware and developed your SharePoint masterpiece. The job of ensuring performance will work to expectations does not stop here. You still should consider load testing to ensure that the deployed topology meets the expectations and validates the lead indicators. There is also a  seemingly endless number of optimisations that can be done within SharePoint too, such as caching to reduce SQL Server load or tuning web application or service application settings.

But for now, I hope that this series has met my stated goal of making this topic area that little bit more accessible and thankyou so much for taking the time to read it all.

 

Paul Culmsee

www.hereticsguidebooks.com

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Demystifying SharePoint Performance Management Part 10 – More tools of the trade…

This entry is part 10 of 11 in the series Perf
Send to Kindle

Hi all and welcome to the tenth article in my series on demystifying SharePoint performance management. I do feel that we are getting toward the home stretch here. If you go way back to Part 1, I stated my intent to highlight some common misconceptions and traps for younger players in the area of SharePoint performance management, while demonstrating a better way to think about measuring SharePoint performance (i.e. lead and lag indicators). While doing so, we examined the common performance indicators of RPS, IOPS, MBPS, latency and the tools and approaches to measuring and using them.

I started the series praising some of Microsoft’s material, namely the “Planning guide for server farms and environments for Microsoft SharePoint Server 2010.”, “Capacity Planning for Microsoft SharePoint Server 2010” and “Analysing Microsoft SharePoint Products and Technologies Usage” guides. But they are not perfect by any stretch, and in last post, I covered some of the inconsistencies and questionable information that does exist in the capacity planning guide in particular. Not only are some of the disk performance figures quoted given without any critical context needed to understand how to measure them in a meaningful way, some of the figures themselves are highly questionable.

I therefore concluded Part 9 by advising readers not to believe everything presented and always verify espoused reality with actual reality via testing and measurement.

Along the journey that we have undertaken, we have examined some of the tools that are available to perform such testing and measurement. So far, we have used Log Parser, SharePoint Flavored Weblog Reader, Windows Performance Monitor, SQLIO and the odd bit of PowerShell thrown in for good measure. This article will round things out by showing you two additional tools to verify theoretical fiction with hard cold reality. Both of these tools allow you to get a really good sense of IO patterns in particular (although they both have many other purposes). The first of which will be familiar to my more nerdy readers, but the second of which is highly powerful, but much lesser known to newbies and seasoned IT pros alike.

So without further adieu, lets meet our tools… Process Monitor and Windows Performance Analysis Toolkit.

Process Monitor

Our first tool is Process Monitor, also commonly known as Procmon. Now this tool is quite well known, so I will not be particularly verbose with my examination of it. But for the three of you who have never heard of this tool, Process Monitor allows us to (among many other things), monitor accesses to the file system by processes running on a server. This allows us to get a really low level view of IO requests as they happen in real time. What is really nice about Process Monitor is its granularity. It allows you to set up some sophisticated filtering that allows you to really see the wood from the trees. For example, one can create fairly elaborate filters that allow you to just see the details of a specific SQL database. Also handy is that all collected data can be saved to file for later examination too.

When you start Process Monitor, you will see a screen something like the one below. It will immediately start collecting data about various operations (there are around 140 monitorable operations that cover file system, registry, process, network and kernel stuff). When you launch Process Monitor it immediately starts monitoring file system, registry and processes. The default columns that are displayed include:

  • the name of the process performing the operation
  • the operation itself
  • the path to the object the operation was performed on
  • (and most importantly), a detail column, that tells you the good stuff.

image

The best way to learn Process Monitor is by example, so lets use it to collect SQL Server IO patterns on SharePoint databases when performing a full crawl in SharePoint (while excluding writes to transaction logs). It will be interesting to see the the range of IO request sizes during this time. To achieve this, we need to set up the filters for procmon to give us just what we need…

First up,  choose “Filter…” from the Filter menu.

image

In the top left column, choose “Process Name” from the list of columns. Leave the condition field as “is” and click on the drop down next to it. It will enumerate the running processes, allowing you to scroll down and find sqlservr.exe.

image   image

Click OK and your newly minted filter will be added to the list (check out the green include filter below). Now we will only see operations performed by SQL Server in the Process Monitor display.

image

Rather than give you a dose of screenshot hell, I will not individually show you how to add each filter as they are a similar process to what we just did to include only SQLSERVR.EXE. In all, we have to apply another 5 filters. The first two filter the operations down to reading and writing to the disk.

  • Filter on: Process Name
  • Condition: Is
  • Filter  applied: ReadFile
  • Filter type: Include
  • Filter on: Process Name
  • Condition: Is
  • Filter applied: WriteFile
  • Filter type: Include

Now we need to specify the database(s) that we are interested in. On my test server, SharePoint databases  are on a disk array mounted as D:\ drive. So I add the following filter:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: D:\DATA\MSSQL
  • Filter type: Include

Finally, we want to exclude writes to translation logs. Since all transaction logs write to files with an .LDF extension, we can use an exclusion rule:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: LDF
  • Filter type: Exclude

Okay, so we have our filters set. Now widen the detail column that I mentioned earlier. If you have captured some entries, you should see the word “Length :” with a number next to it. This is reporting the size of the IO request in bytes. Divide by 1024 if you want to get to kilobytes (KB). Below you can see a range of 1.5KB to 32KB.

image

At this point you are all set. Go to SharePoint central administration and find the search service application. Start a full crawl and fairly quickly, you should see matching disk IO operations displayed in Process Monitor. When the crawl is finished, you can choose to stop capturing and save the resulting capture to file. Process Monitor supports CSV format, which makes it easy to import into Excel as shown below. (In the example below I created a formula for the column called “IO Size”.

image

By the way, in my quick test analysis of disk IO of a window during during part of the during a full crawl, I captured 329 requests that were broken down as follows:

  • 142 IO requests (42% of total) were 8KB in size for a total of 1136KB
  • 48 IO requests (15% in total) were 16KB in size for a total of 768KB
  • 48 IO requests (15% in total) were >16KB to 32KB in size for a total of  1136KB
  • 49 IO requests (15% in total) were >32KB to 64KB in size for a total of 2552KB
  • 22 IO requests (7% in total) were >64KB to 128KB in size for a total of 2104KB
  • 20 IO requests (6% in total) were >128KB to 256KB in size for a total of 3904KB

Windows Performance Analyser (with a little help from Xperf123)

Allow me to introduce you to one of the best tools you never knew you needed. Windows Performance Analyser (WPA) is a newer addition to the armoury of tools for performance analysis and capacity planning. In short, it takes the idea of Windows Performance Monitor to a whole new level. WPA comes as part of a broader suite of tools collectively known as the Windows Performance Toolkit (WPT). Microsoft describes the toolkit as:

…designed for analysis of a wide range of performance problems including application start times, boot issues, deferred procedure calls and interrupt activity (DPCs and ISRs), system responsiveness issues, application resource usage, and interrupt storms.”

If most of those terms sounded scary to you, then it should be clear that WPA is a pretty serious tool and has utility for many things, going way beyond our narrow focus of Disk performance. But never fear BA’s, I am not going to take a deep dive approach to explaining this tool. Instead I am going to outline the quickest and simplest way to leverage WPA for examining disk IO patterns. In fact, you should be able to follow what I outline here on your own PC if SharePoint is not conveniently located nearby.

Now this gem of a tool is not available as a separate download. It actually comes installed as part of the Microsoft Windows SDK for Windows 7 and .NET Framework 4. Admins fearing bloat on their servers can rest easy though, as you can choose just to install the WPT components as shown below…

image_thumb6_thumb

By default, the windows performance toolkit will install its goodies into the C:\Program Files\Microsoft Windows Performance Toolkit” folder. So go ahead and install it now since it can be installed onto any version of Windows past Vista. (I am sure that none of you at all are reading this article on an Apple device right? :-).

Now assuming you have successfully installed WPT, I now want you to head on over to codeplex and download a little tool called Xperf123 and save it into the toolkit folder above. Xperf123 is a 3rd party tool that hides a lot of the complexity of WPA and thus is a useful starting point. The only thing to bear in mind is that Xperf123 is not part of WPA and is therefore not a necessity. If your inner tech geek wants to get to know the WPA commands better, then I highly recommend you read a highly comprehensive article written by Microsoft’s Robert Smith and published back in Feb 2012. The article is called “Analysing Storage Performance using the Windows Performance Analysis Toolkit” and it is an outstanding piece of work in this area.

So we are all set. Let’s run the same test as we did with Procmon earlier. I will start a trace on my test SharePoint server, run a full crawl and then look at the resulting IO patterns. Perform the following steps in sequence. (in my example I am using a test SharePoint server).

  1. Start Xperf123 from the WPT installation folder (run it as administrator).
  2. At the initial screen, click Next and then Next again at the screen displaying operating system details
  3. From the Select Trace Type dropdown, choose Disk  I/O and press Next
  4. Ensure that “Enable Perfmon”, “use Circular Logging” and optionally choose “Specify Output Directory”. Press Next
  5. Leave “Stackwalk” unticked and choose Next

image     image

image  image

AllrIghtie then… we are all set! Click the Start Capture Button to start collecting the good stuff! Xperf123 will run the actual WPA command line trace utility (called xperf.exe if you really want to know). Now go to SharePoint central administration and like what we did with our test of Process Monitor, start a full crawl. Wait till the crawl finishes and then in Xperf123, click the Stop Capture Button. A trace file will be saved in the WPT installation folder (or wherever you specify). The naming convention will be based on the server name and date the trace was run.

image  image

image

Okay, so capturing the trace was positively easy. How about analysing it visually?

Double click on the newly minted trace file and it will be loaded into the Performance Analyser analysis tool (This tool is also available from the Start menu as well). When the tool loads and processes the trace file, you will see CPU and Disk IO counts reported visually. The CPU is the line graph and IO counts are represented in a bar graph. Unlike Windows Performance Monitor which we covered in Part 7, this tool has a much better ability to drill down into details.

If you look closely below there are two “flyout” arrows. One is on the left side in the middle of the screen and applies to all graphs and the other is on the top-right of each graph. If you click them, you are able to apply filters to what information is displayed. For example: if you click the “IO Counts” flyout, you can filter out which type of IO you want to see. Looking at the screenshot below, you can see that the majority of what was captured were disk writes (the blue bars below).

image  image

Okay so lets get a closer look at what is going on disk-wise. Right click somewhere on the Disk IO bar graph and choose “Detail Graph” from the menu.

image

Now we have a very different view. On the left we can see which disk we are looking at and on the right we can see detailed performance stats for that disk. It may not be clear by the screenshot, but the disk IO reported below is broken down by process. This detailed view also has flyouts and dropdowns that allow you to filter what you see. There is an upper-left dropdown menu under the section called “Physical Disk”. This allows you to change what disk you are interested in. On the upper right, there is a flyout labelled “Process Name”. Clicking this allows you to filter the display to only view a subset of the process that were running at the time the trace was captured.

image

Now in my case, I only want to see the SQL Database activity, so I will make use of the aforementioned filtering capability. Below I show where I selected the disk where the database files reside and on the right I deselected all processes apart from SQLSERVR.EXE. Neat huh? Now we are looking at the graph of every individual IO operation performed during the time displayed and you can even hover over each dot to get more detail of the IO operation performed.

image

You can also zoom in with great granularity. Simply select a time period from the display using by dragging the mouse pointer over the graph area. Right click the now selected time period and choose “Zoom to Selection”. Cool eh? If your mouse has a wheel button, you can zoom in and out by pressing the Ctrl key and rolling the mouse wheel.

image

Now even for most wussy non technical BA reading this, surely your inner nerd is liking what you see. But why stop here? After all, Process Monitor gave us lots more loving detail and had the ability to utilise sophisticated filtering. So how does WPA stack up?

To answer this question, try these steps, Right click on the detail graph and this time choose “Summary Table”. This allows us to view even more detail of IO data.

image

image

Viola! We now have a list of every IO transaction performed during the sample period. Each line in the summary table represents a single I/O operation. The columns are movable and sortable as well. On that note, some of the more interesting ones for our purposes include (thanks Robert Smith for the explanation of these):

  • IO Type: Read, Write, or Flush
  • Complete Time: Time of I/O completion in milliseconds, relative to start and stop of the current trace.
  • IO Time: The amount of time in milliseconds the I/O took to complete
  • Disk Service Time: The inferred amount of time (in microseconds) the IO operation has spent on the device (this one has caveats, check Robert Smiths post for detail).
  • QD/I: Queue Depth the disk , irrespective of partitions, at the time this I/O request initialized
  • IO Size: Size of this I/O, in bytes.
  • Process Name: The name of the process that initiated this I/O.
  • Path: Path and file name, if known, that is the target of this I/O (in plain English, this essentially means the file name).

I have a lot of IO requests in this summary view, so let’s see how this baby can filter. First up, lets only look at IO that was initiated from SQL Server only. Right click on the “Process Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “SQLSERVR.EXE” in the “Find what:” textbox. Double check that the column name is set to “Process name” in the dropdown and click Filter.

image  image

You should now see only SQL IO traffic. So let’s drill down further still. This time I want to exclude IO transactions that are transaction log related. To do this, right click on the “Path Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “MDF” in the “Find what:” textbox. Double check that the column name is set to “Path name” in the dropdown and click Filter.

image image

Can you guess the effect? Only SQL Server database files will be displayed since they typically have a file extension of MDF.

In the screenshot below, I have then used the column sorting capability to look at the IO sizes. Neat huh?

image

Don’t forget Performance Monitor…

Just before we are done with Windows Performance Analysis Toolkit, cast your mind back to the start of this walkthrough when we used Xperf123 to generate this trace. If you check back, you might recall there was a tickbox in the Xperf123 wizard called “Enable Perfmon”. Well it turns out that Xperf123 had one final perk. While the WPA trace was made, a Perfmon trace was made of the system performance more broadly during the time. These logs are located in the C:\PerfLogs\ directory and are saved in the native Windows Performance Monitor format. So just double click the file and watch the love…

image

How’s that for a handy added bonus. It is also worth mentioning that the Perfmon trace captured has a significant number of performance counters in the categories of Memory, PhysicalDisk, Processor and System.

Conclusion and coming next…

Well! That was a long post, but that was more because of verbose screenshots than anything else.

Both Windows Performance Monitor and Windows Performance Analyser are very useful tools for developing a better understanding of disk IO patterns. While Procmon has more sophisticated filtering capabilities, WPA trumps Procmon in terms of reduced overhead (apparently 20,000 events per second is less than 2% CPU on a 2.0 GHz processor ). WPA also has the ability to visualise and drill down into the data better than Procmon can do.

Nevertheless, both tools have far more utility beyond the basic scenarios outlined in this series and are definitely worth investigating more.

In the next and I suspect final post, I will round off this examination of performance by making a few more general SharePoint performance recommendations and outlining a lightweight methodology that you can use for your own assessments.

Until then, thanks for reading…

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle