Demystifying SharePoint Performance Management Part 11 – Tales from the Microsoft labs

This entry is part 11 of 11 in the series Perf
Send to Kindle

Hi all and welcome to the final article in my series on SharePoint performance management – for now anyway. Once SharePoint 2013 goes RTM, I might revisit this topic if it makes sense to, but some other blogging topics have caught my attention.

To recap the entire journey, the scene was set in part 1 with the distinction of lead and lag indicators. In part 2, we then examined Requests per Second (RPS) and looked at its strengths and weakness as a performance metric. From there, we spent part 3 looking at how to leverage RPS via the Log Parser utility and a little PowerShell goodness. Part 4 rounded off our examination of RPS by delving deeper into utilising log parser to eke out interesting RPS related performance metrics. We also covering the very excellent SharePoint Flavored Weblog Reader utility, which saves a bunch of work and can give some terrific insights. Part 5 switched tack into the wonderful world of latency, and in particular, focused on disk latency. Part 6 then introduced the disk performance metrics of IOPS, MBPS and their relationship to latency. We also looked at typical SharePoint and SQL Server disk IO characteristics and then examined the pros and cons of RPS, IOPS, Latency, MBPS and how they all relate to each other. In part 7 and continuing into part 8, we introduced the performance monitor counters that give us insight into these counters, as well as introduced the SQLIO utility to stress test disk infrastructure. This set the scene for part 9, where we took a critical look at Microsoft’s own real-world findings to help us understand what suitable figures would be. The last post then introduced a couple of other excellent tools, namely Process Monitor and Windows Performance Analysis Toolkit that should be in your arsenal for SharePoint performance.

In this final article, we will tie up a few loose ends.

Insights from Microsoft labs testing…

In part 9 of this series, I examined Microsoft’s performance figures reported from their production SharePoint 2010 deployments. This information comes from the oft mentioned SharePoint 2010 Capacity Planning Guide. Microsoft are a large company and they have four different SharePoint farms for different collaborative scenarios. To recap, those scenarios were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company. Performance reported for this environment was 33580 unique users per day, with an average of 172 concurrent users and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. Performance reported for this environment was double the first environment with 69702 unique users per day. Concurrency was more than double, with an average of 420 concurrent users and a peak concurrency of 1433 users.
  3. Departmental collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department. Performance reported for this environment was a much lower figure of 9186 unique users per day (which makes sense given it is departmental stuff). Nevertheless, concurrency was similar to the enterprise intranet scenario with an average of 189 concurrent users and a peak concurrency of 322 users.
  4. Social collaboration environment. This is Microsoft’s My Sites scenario, connecting employees with one another and presenting personal information such as areas of expertise, past projects, and colleagues to the wider organization. This included personal sites and documents for collaboration. Performance reported for this environment was 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

Presented as a table, we have the following rankings:

Scenario Social Collaboration Enterprise Intranet Collaboration Enterprise Intranet Departmental Collaboration
Unique Users 69814 69072 33580 9186
Avg Concurrent 639 420 172 189
Peak  Concurrent 1186 1433 376 322

When you think about it, the performance information reported for these scenarios are lag indicator based. That is, they are real-world performance statistics based on a pre-existing deployment. Thus while we can utilise the above figures for some insights into estimating the performance needs of our own SharePoint environments, they lack important detail. For example: in each scenario above, while the SharePoint farm topology was specified, we have no visibility into how these environments were scaled out to meet performance needs.

Some lead indicator perspective…

Luckily for us, Microsoft did more than simply report on the performance of the above four collaboration scenarios. For two of the scenarios Microsoft created test labs and published performance results with different SharePoint farm topologies. This is really handy indeed, because it paints a much better lead indicator scenario. We get to see what bottlenecks occurred as the load on the farm was increased. We also get insight about what Microsoft did to alleviate the bottlenecks and what sort of a difference it made.

The first lab testing was based off Microsoft’s own Departmental collaboration environment (the 3rd scenario above) and is covered on pages 144-162 of the capacity planning guide. The other lab was based off the Enterprise Intranet Collaboration Environment (the 2nd scenario above) and is the focus of attention on pages 174-197. Consult the guide for full detail of the tests. This is just a quick synthesis.

Lab 1: Enterprise Intranet Collaboration Environment

In this lab, Microsoft took a subset of the data from their production environment using different hardware. They acknowledge that the test results will be affected by this, but in my view it is not a show stopper if you take a lead indicator viewpoint. Microsoft tested web server scale out initially by starting out with a 3 server topology (one web front end server, one application server and one database server). They then increased the load on the farm until they reached a saturation point. Once this happened, they added an additional web server to see what would happen. This was repeated and scaled from one Web server (1x1x1) to five Web servers (5x1x1).

The transactional mix used for this testing was based on the breakdown of transactions from the live system. Little indication of read vs. write transactions are given in the case study, but on page 152 there is a detailed breakdown of SharePoint traffic by type. While I won’t detail everything here, regular old browser traffic was the most common, representing 36% of all test traffic. WebDAV came in second (WebDAV typically includes office clients and windows explorer view) representing 28.12 of traffic and outlook sync traffic third at 7.04%.

Below is a table showing the figures where things bottlenecked. Microsoft produce many graphs in their documentation so the figures below are an approximation based on my reading of them. It is also important to note that Microsoft did not perform tests while search was running, and compensated for search overhead by defining a max CPU limit for SQL Server of 80%.

1*1*1 2*1*1 3*1*1 4*1*1 5*1*1
Max RPS 180 330 510 560 565
Sustainable RPS 115 210 305 390 380
Latency .3 .2 .35 .2 .2
IOPS 460 710 910 920 840
WFE CPU 96% 89% 89% 76% 58%
SQL CPU 17% 33% 65% 78% 79%

For what its worth, the sustainable RPS figure is based on the server not being stressed (all servers having less than 50% CPU). Looking at the above figures, several things are apparent.

  1. The environment scaled up to four Web servers before the bottleneck changed to be CPU usage on the database server
  2. Once database server CPU hit its limits, RPS on the web servers suffered. Note that RPS from 4*1*1 to 5*1*1 is negligible when SQL CPU was saturated.
  3. The addition of the fourth Web server had the least impact on scalability compared to the preceding three (RPS only increased from 510 to 560 which is much less then adding the previous web servers). This suggests the SQL bottleneck hit somewhere between 3 and 4 web servers.
  4. The average latency was almost constant throughout the whole test, unaffected by the number of Web servers and throughput. This suggests that we never hit any disk IO bottlenecks.

Once Microsoft identified the point at which Database server CPU was the bottleneck (4*1*1), they added an additional database server and then kept adding more webservers like they did previously. They split half the content databases onto one SQL server and half on the other. It is important to note that the underlying disk infrastructure was unchanged, meaning that total disk performance capability was kept constant even though there were now two database servers. This allowed Microsoft to isolate server capability from disk capability. Here is what happened:

4*1*1 4*1*2 6*1*2 8*1*2
RPS 560 660 890 930
Latency .2 .35 .2 .2
IOPS 910 1100 1350 1330
WFE CPU 76% 87% 78% 61%
SQL CPU 78% 33% 52% 58%

Here is what we can glean from these figures.

  1. Adding a second database server did not provide much additional RPS (560 to 660). This is because CPU utilization on the Web servers was high. In effect, the bottleneck shifted back to the web front end servers.
  2. With two database servers and eight web servers (8*1*2), the bottleneck became the disk infrastructure. (Note the IOPS at 6*1*2 is no better than 8*1*2).

So what can we conclude? From the figures shown above, it appears that you could reasonably expect (remember we are talking lead indicators here) that bottlenecks are likely to occur in the following order:

  1. Web Server CPU
  2. Database Server CPU
  3. Disk IOPS

It would be a stretch to suggest when each of these would happen because there are far too many variables to consider. But let’s now examine the second lab case study to see if this pattern is consistent.

Lab 2: Divisional Portal Environment

In this lab, Microsoft took a different approach from lab we just examined. This time they did not concern themselves with IOPS (“we did not consider disk I/O as a limiting factor. It is assumed that an infinite number of spindles are available”). The aim this time was to determine at what point a SQL Server CPU bottleneck was encountered. Based on what I have noted from the first lab test above, unless your disk infrastructure is particularly crap, SQL Server CPU should become a bottleneck before IOPS. However, one thing in common with the last lab test was that Microsoft factored in the effects of an ongoing search crawl by assuming 80% SQL Server CPU as the bottleneck indicator.

Much more detail was provided on the transaction breakdown for this lab. Page 181 and 182 lists transactions by type and and unlike the first lab, whether they are read or write. While it is hard to directly compare to lab 1, it appears that more traffic is oriented around document collaboration than in the first lab.

The basic methodology of this was to start off with a minimal farm configuration of a combined web/application server and one database server. Through multiple iterations, the test ended with a configuration of three Web servers, one application server, one database server.  The table of results are below:

1*1 1*1*1 2*1*1 3*1*1
RPS 101 150 318 310
Sustainable RPS 75 99 191 242
Latency .81 .85 .6 .8
Users simulated 125 150 200 226
WFE CPU 86% 36% 76% 42%
APP CPU NA 41% 46% 44%
SQL CPU 18% 32% 56% 75%

Here is what we can glean from these figures.

  1. Web Server CPU was the first bottleneck encountered.
  2. At a 3*1*1 configuration, SQL Server CPU became the bottleneck.  In lab 1 it was somewhere between the 3rd and 4th webserver.
  3. RPS, when CPU is taken into account, is fairly similar between each lab. For example, in the first lab, the 2*1*1 scenario RPS was 330. In this lab it was 318 and both had comparable CPU usage. The 1*1*1 test, had differing results (101 vs 180) , but if you adjust for the reported CPU usage, things even up.
  4. With each additional Web server, increase in RPS was almost linear. We can extrapolate that as long as SQL Server is not bottlenecked, you can add more Web servers and an additional increase in RPS is possible.
  5. Latencies are not affected much as we approached bottleneck on SQL Server. Once again, the disk subsystem was never stressed.
  6. The previous assertion that bottlenecks are likely to occur in the the order of Web Server CPU, Database Server CPU and then Disk subsystem appears to hold true.

Now we go any further, one important point that I have neglected to mention so far is that the figures above are extremely undesirable. Do you really want your web server and database server to be at 85% constantly? I think not. What you are seeing above are the upper limits, based on Microsoft’s testing. While this helps us understand maximum theoretical capacity, it does not make for a particularly scalable environment.

To account for the issue of reporting on max load, Microsoft defined what they termed as a “green zone” of performance. This is a term to describe what “normal” load conditions look like (for example, less than 50% CPU) and they also provided RPS results for when the servers were in that zone. If you look closely in the above tables you will see those RPS figures there as I labelled them as “Sustainable RPS”.

In case you are wondering, the % difference between sustainable RPS and peak RPS for each of the scenarios ranges between 60-75% of the peak RPS reported.

Some Microsoft math…

In the second lab, Microsoft offers some advice on how translate their results into our own deployments. They suggest determining a users to RPS ratio and then utilising the green zone RPS figures to estimate server requirements. This is best served via their own example from lab 2: They state the following:

  • A divisional portal in Microsoft which supports around 8000 employees collaborating heavily, experiences an average RPS of 110.
  • That gives a Users to RPS ratio of ~72 (that is, 8000/110). That is: 72 users will amount to 1 RPS.
  • Using this ratio and assuming the sustainable RPS figures from lab 2 results, Microsoft created the following table (page 196) to suggest the number of users a typical deployment might support.

image

A basic performance planning methodology…

Okay.. so I am done… I have no more topics that I want to cover (although I could go on forever on this stuff). Hopefully I have laid out enough conceptual scaffolding to allow you to read Microsoft’s large and complex documentation regarding SharePoint performance and capacity guide with more clarity than before. If I were to sum up a few of the key points of this 11 part exploration into the weird and wonderful world of SharePoint performance management it would be as follows:

  1. Part 1: Think of performance in terms of lead and lag indicators. You will have less of a brain fart when trying to make sense of everything.
  2. Part 2: Requests are often confused with transactions. A transaction (eg “save this document”) usually consists of multiple requests and the number of requests is not an indicator of performance. Look to RPS to help here…
  3. Part 3 and 4: The key to utilising RPS is to understand that as a counter on its own, it is not overly helpful. BUT it is the one metric that you probably have available in lots of detail, due to it being captured in web server logs over time. Use it to understand usage patterns of your sites and portals and determine peak usage and concurrent usage.
  4. Part 5: Latency (and disk latency in particular) is both unavoidable, yet one of the most common root causes of performance issues. Understanding it is critical.
  5. Part 6: Disk latency affects – and is affected by – IOPS, IO size and IO patterns. Focusing one one without the others is quite pointless. They all affect each other so watch out when they are specified in isolation (ie 5000 IOPS).
  6. Part 6, 7 and 8:  Latency and IOPS are handy in the  sense that they can be easily simulated and are therefore useful lead indicators. Test all SQL IO scenarios at 8k and 64K IO size and ensure it meets latency requirements.
  7. Part 9: Give your SAN dudes a specified IOPS, IO Size and latency target. Let them figure out the disk configuration that is needed to accommodate. If they can make your target then focus on other bottleneck areas.
  8. Part 10: Process Monitor and Windows Performance Analyser are brilliant tools for understanding disk IO patterns (among other things)
  9. Part 9 and 11: Don’t believe everything you read. Utilise Microsoft’s real world and lab results as a guide but validate expected behaviour by testing your own environment and look for gaps between what is expected and what you get.
  10. Part 11: In general, Web Server CPU will bottleneck first, followed by SQL Server CPU. If you followed the advice of points 6 and 7 above, then disk shouldn’t  be a problem.

Now I promised at the very start of this series, that I would provide you with a lightweight methodology for estimating SharePoint performance requirements. So assuming you have read this entire series and understand the implications, here goes nothing…

If they can meet this target, skip to step 8.  🙂

If they cannot meet this, don’t worry because there are two benefits gained already. First, by finding that they cannot get near the above figures, they will do some optimisation like test different stipe sizes and check some other common disk performance hiccups. This means they now better understand the disk performance patterns and are thinking in terms of lead indicators. The second benefit is that you can avoid tedious, detailed discussions on what RAID level to go with up front.

So while all of this is happening, do some more recon…

  • 4. Examine Microsoft and HP’s testing results that I covered in part 9 and in this article. Pay particular attention to the concurrent users and RPS figures. Also note the IOPS results from Microsoft and HP testing. To remind you, no test ever came in over 1400 IOPS.
  • 5. Use logparser to examine your own logs to understand usage patterns. Not only should you eke out metrics like max concurrent users and RPS figures, but examine peak times of the day, RPS growth rate over time, and what client applications or devices are being used to access your portal or intranet.
  • 6. Compare your peak and concurrent usage stats to Microsoft and HP’s findings. Are you half their size, double their size? This can give you some insight on a lower IOPS target to use. If you have 200 simultaneous users, then you can derive a target IOPS for your storage guys to meet that is more modest and in line with your own organisations size and make-up.

By now the storage guys will come back to you in shock because they cannot get near your 5000 IOPS requirement. Be warned though… they might ask you to sign a cheque to upgrade the storage subsystem to meet this requirement. It won’t be coming out of their budget for sure!

  • 7. Tell them to slowly reduce the IOPS until they hit the 8ms and 1ms latency targets and give them the revised target based on the calculation you made in step 6. If they still cannot make this, then sign the damn cheque!

At this point we have assumed that there is enough assurance that the disk infrastructure is sound. Now its all about CPU and memory.

  • 8. Determine a users to RPS ratio by dividing your total user base by average RPS (based on your findings from step 5).
  • 9.  Look at Microsoft’s published table (page 196 of the capacity planning guide and reproduced here just above this conclusion). See what it suggests for the minimum topology that should be needed for deployment.
  • 10. Use that as a baseline and now start to consider redundancy, load balancing and all of that other fun stuff.

And there you have it! My 10 step dodgy performance analysis method.  Smile

Conclusion and where to go next…

Right! Although I am done with this topic area, there are some next steps to consider.

Remember that this entire series is predicated on the notion that you are in the planning stage. Let’s say you have come up with a suggested topology and deployed the hardware and developed your SharePoint masterpiece. The job of ensuring performance will work to expectations does not stop here. You still should consider load testing to ensure that the deployed topology meets the expectations and validates the lead indicators. There is also a  seemingly endless number of optimisations that can be done within SharePoint too, such as caching to reduce SQL Server load or tuning web application or service application settings.

But for now, I hope that this series has met my stated goal of making this topic area that little bit more accessible and thankyou so much for taking the time to read it all.

 

Paul Culmsee

www.hereticsguidebooks.com

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Demystifying SharePoint Performance Management Part 10 – More tools of the trade…

This entry is part 10 of 11 in the series Perf
Send to Kindle

Hi all and welcome to the tenth article in my series on demystifying SharePoint performance management. I do feel that we are getting toward the home stretch here. If you go way back to Part 1, I stated my intent to highlight some common misconceptions and traps for younger players in the area of SharePoint performance management, while demonstrating a better way to think about measuring SharePoint performance (i.e. lead and lag indicators). While doing so, we examined the common performance indicators of RPS, IOPS, MBPS, latency and the tools and approaches to measuring and using them.

I started the series praising some of Microsoft’s material, namely the “Planning guide for server farms and environments for Microsoft SharePoint Server 2010.”, “Capacity Planning for Microsoft SharePoint Server 2010” and “Analysing Microsoft SharePoint Products and Technologies Usage” guides. But they are not perfect by any stretch, and in last post, I covered some of the inconsistencies and questionable information that does exist in the capacity planning guide in particular. Not only are some of the disk performance figures quoted given without any critical context needed to understand how to measure them in a meaningful way, some of the figures themselves are highly questionable.

I therefore concluded Part 9 by advising readers not to believe everything presented and always verify espoused reality with actual reality via testing and measurement.

Along the journey that we have undertaken, we have examined some of the tools that are available to perform such testing and measurement. So far, we have used Log Parser, SharePoint Flavored Weblog Reader, Windows Performance Monitor, SQLIO and the odd bit of PowerShell thrown in for good measure. This article will round things out by showing you two additional tools to verify theoretical fiction with hard cold reality. Both of these tools allow you to get a really good sense of IO patterns in particular (although they both have many other purposes). The first of which will be familiar to my more nerdy readers, but the second of which is highly powerful, but much lesser known to newbies and seasoned IT pros alike.

So without further adieu, lets meet our tools… Process Monitor and Windows Performance Analysis Toolkit.

Process Monitor

Our first tool is Process Monitor, also commonly known as Procmon. Now this tool is quite well known, so I will not be particularly verbose with my examination of it. But for the three of you who have never heard of this tool, Process Monitor allows us to (among many other things), monitor accesses to the file system by processes running on a server. This allows us to get a really low level view of IO requests as they happen in real time. What is really nice about Process Monitor is its granularity. It allows you to set up some sophisticated filtering that allows you to really see the wood from the trees. For example, one can create fairly elaborate filters that allow you to just see the details of a specific SQL database. Also handy is that all collected data can be saved to file for later examination too.

When you start Process Monitor, you will see a screen something like the one below. It will immediately start collecting data about various operations (there are around 140 monitorable operations that cover file system, registry, process, network and kernel stuff). When you launch Process Monitor it immediately starts monitoring file system, registry and processes. The default columns that are displayed include:

  • the name of the process performing the operation
  • the operation itself
  • the path to the object the operation was performed on
  • (and most importantly), a detail column, that tells you the good stuff.

image

The best way to learn Process Monitor is by example, so lets use it to collect SQL Server IO patterns on SharePoint databases when performing a full crawl in SharePoint (while excluding writes to transaction logs). It will be interesting to see the the range of IO request sizes during this time. To achieve this, we need to set up the filters for procmon to give us just what we need…

First up,  choose “Filter…” from the Filter menu.

image

In the top left column, choose “Process Name” from the list of columns. Leave the condition field as “is” and click on the drop down next to it. It will enumerate the running processes, allowing you to scroll down and find sqlservr.exe.

image   image

Click OK and your newly minted filter will be added to the list (check out the green include filter below). Now we will only see operations performed by SQL Server in the Process Monitor display.

image

Rather than give you a dose of screenshot hell, I will not individually show you how to add each filter as they are a similar process to what we just did to include only SQLSERVR.EXE. In all, we have to apply another 5 filters. The first two filter the operations down to reading and writing to the disk.

  • Filter on: Process Name
  • Condition: Is
  • Filter  applied: ReadFile
  • Filter type: Include
  • Filter on: Process Name
  • Condition: Is
  • Filter applied: WriteFile
  • Filter type: Include

Now we need to specify the database(s) that we are interested in. On my test server, SharePoint databases  are on a disk array mounted as D:\ drive. So I add the following filter:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: D:\DATA\MSSQL
  • Filter type: Include

Finally, we want to exclude writes to translation logs. Since all transaction logs write to files with an .LDF extension, we can use an exclusion rule:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: LDF
  • Filter type: Exclude

Okay, so we have our filters set. Now widen the detail column that I mentioned earlier. If you have captured some entries, you should see the word “Length :” with a number next to it. This is reporting the size of the IO request in bytes. Divide by 1024 if you want to get to kilobytes (KB). Below you can see a range of 1.5KB to 32KB.

image

At this point you are all set. Go to SharePoint central administration and find the search service application. Start a full crawl and fairly quickly, you should see matching disk IO operations displayed in Process Monitor. When the crawl is finished, you can choose to stop capturing and save the resulting capture to file. Process Monitor supports CSV format, which makes it easy to import into Excel as shown below. (In the example below I created a formula for the column called “IO Size”.

image

By the way, in my quick test analysis of disk IO of a window during during part of the during a full crawl, I captured 329 requests that were broken down as follows:

  • 142 IO requests (42% of total) were 8KB in size for a total of 1136KB
  • 48 IO requests (15% in total) were 16KB in size for a total of 768KB
  • 48 IO requests (15% in total) were >16KB to 32KB in size for a total of  1136KB
  • 49 IO requests (15% in total) were >32KB to 64KB in size for a total of 2552KB
  • 22 IO requests (7% in total) were >64KB to 128KB in size for a total of 2104KB
  • 20 IO requests (6% in total) were >128KB to 256KB in size for a total of 3904KB

Windows Performance Analyser (with a little help from Xperf123)

Allow me to introduce you to one of the best tools you never knew you needed. Windows Performance Analyser (WPA) is a newer addition to the armoury of tools for performance analysis and capacity planning. In short, it takes the idea of Windows Performance Monitor to a whole new level. WPA comes as part of a broader suite of tools collectively known as the Windows Performance Toolkit (WPT). Microsoft describes the toolkit as:

…designed for analysis of a wide range of performance problems including application start times, boot issues, deferred procedure calls and interrupt activity (DPCs and ISRs), system responsiveness issues, application resource usage, and interrupt storms.”

If most of those terms sounded scary to you, then it should be clear that WPA is a pretty serious tool and has utility for many things, going way beyond our narrow focus of Disk performance. But never fear BA’s, I am not going to take a deep dive approach to explaining this tool. Instead I am going to outline the quickest and simplest way to leverage WPA for examining disk IO patterns. In fact, you should be able to follow what I outline here on your own PC if SharePoint is not conveniently located nearby.

Now this gem of a tool is not available as a separate download. It actually comes installed as part of the Microsoft Windows SDK for Windows 7 and .NET Framework 4. Admins fearing bloat on their servers can rest easy though, as you can choose just to install the WPT components as shown below…

image_thumb6_thumb

By default, the windows performance toolkit will install its goodies into the C:\Program Files\Microsoft Windows Performance Toolkit” folder. So go ahead and install it now since it can be installed onto any version of Windows past Vista. (I am sure that none of you at all are reading this article on an Apple device right? :-).

Now assuming you have successfully installed WPT, I now want you to head on over to codeplex and download a little tool called Xperf123 and save it into the toolkit folder above. Xperf123 is a 3rd party tool that hides a lot of the complexity of WPA and thus is a useful starting point. The only thing to bear in mind is that Xperf123 is not part of WPA and is therefore not a necessity. If your inner tech geek wants to get to know the WPA commands better, then I highly recommend you read a highly comprehensive article written by Microsoft’s Robert Smith and published back in Feb 2012. The article is called “Analysing Storage Performance using the Windows Performance Analysis Toolkit” and it is an outstanding piece of work in this area.

So we are all set. Let’s run the same test as we did with Procmon earlier. I will start a trace on my test SharePoint server, run a full crawl and then look at the resulting IO patterns. Perform the following steps in sequence. (in my example I am using a test SharePoint server).

  1. Start Xperf123 from the WPT installation folder (run it as administrator).
  2. At the initial screen, click Next and then Next again at the screen displaying operating system details
  3. From the Select Trace Type dropdown, choose Disk  I/O and press Next
  4. Ensure that “Enable Perfmon”, “use Circular Logging” and optionally choose “Specify Output Directory”. Press Next
  5. Leave “Stackwalk” unticked and choose Next

image     image

image  image

AllrIghtie then… we are all set! Click the Start Capture Button to start collecting the good stuff! Xperf123 will run the actual WPA command line trace utility (called xperf.exe if you really want to know). Now go to SharePoint central administration and like what we did with our test of Process Monitor, start a full crawl. Wait till the crawl finishes and then in Xperf123, click the Stop Capture Button. A trace file will be saved in the WPT installation folder (or wherever you specify). The naming convention will be based on the server name and date the trace was run.

image  image

image

Okay, so capturing the trace was positively easy. How about analysing it visually?

Double click on the newly minted trace file and it will be loaded into the Performance Analyser analysis tool (This tool is also available from the Start menu as well). When the tool loads and processes the trace file, you will see CPU and Disk IO counts reported visually. The CPU is the line graph and IO counts are represented in a bar graph. Unlike Windows Performance Monitor which we covered in Part 7, this tool has a much better ability to drill down into details.

If you look closely below there are two “flyout” arrows. One is on the left side in the middle of the screen and applies to all graphs and the other is on the top-right of each graph. If you click them, you are able to apply filters to what information is displayed. For example: if you click the “IO Counts” flyout, you can filter out which type of IO you want to see. Looking at the screenshot below, you can see that the majority of what was captured were disk writes (the blue bars below).

image  image

Okay so lets get a closer look at what is going on disk-wise. Right click somewhere on the Disk IO bar graph and choose “Detail Graph” from the menu.

image

Now we have a very different view. On the left we can see which disk we are looking at and on the right we can see detailed performance stats for that disk. It may not be clear by the screenshot, but the disk IO reported below is broken down by process. This detailed view also has flyouts and dropdowns that allow you to filter what you see. There is an upper-left dropdown menu under the section called “Physical Disk”. This allows you to change what disk you are interested in. On the upper right, there is a flyout labelled “Process Name”. Clicking this allows you to filter the display to only view a subset of the process that were running at the time the trace was captured.

image

Now in my case, I only want to see the SQL Database activity, so I will make use of the aforementioned filtering capability. Below I show where I selected the disk where the database files reside and on the right I deselected all processes apart from SQLSERVR.EXE. Neat huh? Now we are looking at the graph of every individual IO operation performed during the time displayed and you can even hover over each dot to get more detail of the IO operation performed.

image

You can also zoom in with great granularity. Simply select a time period from the display using by dragging the mouse pointer over the graph area. Right click the now selected time period and choose “Zoom to Selection”. Cool eh? If your mouse has a wheel button, you can zoom in and out by pressing the Ctrl key and rolling the mouse wheel.

image

Now even for most wussy non technical BA reading this, surely your inner nerd is liking what you see. But why stop here? After all, Process Monitor gave us lots more loving detail and had the ability to utilise sophisticated filtering. So how does WPA stack up?

To answer this question, try these steps, Right click on the detail graph and this time choose “Summary Table”. This allows us to view even more detail of IO data.

image

image

Viola! We now have a list of every IO transaction performed during the sample period. Each line in the summary table represents a single I/O operation. The columns are movable and sortable as well. On that note, some of the more interesting ones for our purposes include (thanks Robert Smith for the explanation of these):

  • IO Type: Read, Write, or Flush
  • Complete Time: Time of I/O completion in milliseconds, relative to start and stop of the current trace.
  • IO Time: The amount of time in milliseconds the I/O took to complete
  • Disk Service Time: The inferred amount of time (in microseconds) the IO operation has spent on the device (this one has caveats, check Robert Smiths post for detail).
  • QD/I: Queue Depth the disk , irrespective of partitions, at the time this I/O request initialized
  • IO Size: Size of this I/O, in bytes.
  • Process Name: The name of the process that initiated this I/O.
  • Path: Path and file name, if known, that is the target of this I/O (in plain English, this essentially means the file name).

I have a lot of IO requests in this summary view, so let’s see how this baby can filter. First up, lets only look at IO that was initiated from SQL Server only. Right click on the “Process Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “SQLSERVR.EXE” in the “Find what:” textbox. Double check that the column name is set to “Process name” in the dropdown and click Filter.

image  image

You should now see only SQL IO traffic. So let’s drill down further still. This time I want to exclude IO transactions that are transaction log related. To do this, right click on the “Path Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “MDF” in the “Find what:” textbox. Double check that the column name is set to “Path name” in the dropdown and click Filter.

image image

Can you guess the effect? Only SQL Server database files will be displayed since they typically have a file extension of MDF.

In the screenshot below, I have then used the column sorting capability to look at the IO sizes. Neat huh?

image

Don’t forget Performance Monitor…

Just before we are done with Windows Performance Analysis Toolkit, cast your mind back to the start of this walkthrough when we used Xperf123 to generate this trace. If you check back, you might recall there was a tickbox in the Xperf123 wizard called “Enable Perfmon”. Well it turns out that Xperf123 had one final perk. While the WPA trace was made, a Perfmon trace was made of the system performance more broadly during the time. These logs are located in the C:\PerfLogs\ directory and are saved in the native Windows Performance Monitor format. So just double click the file and watch the love…

image

How’s that for a handy added bonus. It is also worth mentioning that the Perfmon trace captured has a significant number of performance counters in the categories of Memory, PhysicalDisk, Processor and System.

Conclusion and coming next…

Well! That was a long post, but that was more because of verbose screenshots than anything else.

Both Windows Performance Monitor and Windows Performance Analyser are very useful tools for developing a better understanding of disk IO patterns. While Procmon has more sophisticated filtering capabilities, WPA trumps Procmon in terms of reduced overhead (apparently 20,000 events per second is less than 2% CPU on a 2.0 GHz processor ). WPA also has the ability to visualise and drill down into the data better than Procmon can do.

Nevertheless, both tools have far more utility beyond the basic scenarios outlined in this series and are definitely worth investigating more.

In the next and I suspect final post, I will round off this examination of performance by making a few more general SharePoint performance recommendations and outlining a lightweight methodology that you can use for your own assessments.

Until then, thanks for reading…

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

SQL Server oddities

Send to Kindle

So its Saturday night and where I should be out having fun, I am instead sitting in a training room in Wellington New Zealand, configuring a lab for a course I am running on Monday.

Each student lab setup is two virtual machines. The first being a fairly stock standard AD domain controller and the second being a SQL/SharePoint 2010 box. Someone else set up the machines, and I came in to make some changes for the labs next week. But as soon as I fired up the first student VM’s I hit a snag. I loaded SQL Server Management Studio, only to find that I was unable to connect to it as the domain administrator, despite being fairly certain that the account had been granted the sysadmin role in SQL.

clip_image002

Checking the event logs, showed an error that I had not seen before. “Token-based server access validation failed with an infrastructure error”. hmmm

Log Name: Application

Source: MSSQLSERVER

Date: 31/07/2010 6:45:37 p.m.

Event ID: 18456

Task Category: Logon

Level: Information

Keywords: Classic,Audit Failure

User: TRAINSBYDAVE\administrator

Computer: SP01.trainsbydave.com

Description:

Login failed for user ‘TRAINSBYDAVE\administrator’. Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors.

[CLIENT: <local machine>]

Event Xml:

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

<System>

<Provider Name="MSSQLSERVER" />

<EventID Qualifiers="49152">18456</EventID>

<Level>0</Level>

<Task>4</Task>

<Keywords>0x90000000000000</Keywords>

<TimeCreated SystemTime="2010-07-31T06:45:37.000000000Z" />

<EventRecordID>8281</EventRecordID>

<Channel>Application</Channel>

<Computer>SP01.trainsbydave.com</Computer>

<Security UserID="S-1-5-21-3713613819-1395520312-4192346095-500" />

</System>

<EventData>

<Data>TRAINSBYDAVE\administrator</Data>

<Data> Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors.</Data>

<Data> [CLIENT: &lt;local machine&gt;]</Data>

As it happened, whoever set it up had SQL Server in mixed mode authentication, so I was able to sign in as the sa account and have a poke around. All things considered, it should have worked. The user account in question was definitely in the logins and set with sysadmin server rights as shown below.

clip_image004

Uncle google showed a few people with the error but not as many as I expected to see since half the world gets nailed on authentication issues. I also took the event log suggestion and looked for a previous error. A big nope on that suggestion. In fact in all respects, everything looked sweet. The machine was valid on the domain and I was able to perform any other administrative task.

Finally, I removed the TRAINSBYDAVE\administrator account from the list of Logins in SQL Server. It gave the unsurprising whinge about orphaned database users, but luckily for AD accounts when you re-add the same account back it is smart enough to re-establish the link.

clip_image006

As soon as I re-added the account, all was good again. If I was actually interested enough I’d delve into why this happened, but not tonight – I have another 6 student machines to configure 🙂

Goodnight!

clip_image008

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

SharePoint Webcasts: Reporting Services for the Really Really Good Looking

Send to Kindle

imageLast year, Peter Serzo and I presented at the SharePoint Best Practices Conference in DC. We did an extremely serious talk called “SharePoint and SQL Reporting Services 2008 for the really really good looking” which rated rather well. As part of this, we recorded a bunch of screencasts that have never seen the light of day, so I thought that some would benefit from this being released to a wider audience.

Note: This post and content is really going make utterly no sense unless you have watched Zoolander. Even if you have seen the movie, before you launch into the webcasts, some scene setting is required.

The business need

Some time ago, Peter and I were contracted by the Derek Zoolander School for the Really, Really Good Looking after Derek saw Microsoft’s new SharePoint diagram when he accidentally picked up a “Computerworld” magazine. Apart from matching Derek’s suit colour rather nicely, the diagram captivated his imagination with the notion of “Insights”.

Zoolander thought that “Insight”, sounded like the perfect look to follow up from the highly successful “Magnum”, which he used to save the Malaysian prime ministers life. He took the diagram to his wife, and demanded that he must have “Insights” at all costs.

image

Zoolander’s wife saw the business problem that “Insights” would help to address. You see, the Derek Zoolander School for the Really, Really Good Looking, at great expense, custom developed an ERP system to manage everything you needed to know about male models. The system was called the “Computerised Records for Attractive People”…

image

The CRAP system stored all sorts of interesting information about male models, such as tracking their “hotness”, as well as important detail such as stated age versus actual age, and any cosmetic procedures that they have undertaken. After a long and expensive consultation, Peter and I concluded that SharePoint 2007, integrated with SQL Reporting Services, was the perfect solution to create the all important “Insights” that Zoolander so desperately needed.

As a result, we conducted a project kickoff meeting with Hansel and Peter tried to explain the architecture of reporting services using a nice diagram.

image

… but we worked out pretty quickly that this was not the way to explain how it all worked to poor old Hansel…

image

So instead, we went the live demo route. Being male models, custom development was totally out of the question. This solution had to be done using all out of the box methods in a quick and easy manner. Below are the four live demos that were recorded and now you can use them as inspiration for your own male modelling school.

  • Our first webcast illustrates how we were able to create a meaningful report from the CRAP system within five minutes.
  • The second webcast expanded on this idea, by illustrating how reports can be parameterised and linked together for drilldown reporting.
  • The third demo modifies the user profile store to allow for recording of each users unique ID in the CRAP system
  • The last webcast strings this all together for the final demonstration where we pimp the report to make it dynamic with no custom code.

 

image  image

The 5 minute report

Drilling down with Derek
image image

User Profiles for the really really good looking

Pimp my report

 

We hope you find some value from these webcasts and we look forward to hearing about your hot new look as a result!

Thanks for reading

 

 

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

SBS 2008, Hewlett Packard, WSS3, Search Server 2008 Express and a UPS – Oh the pain!

Send to Kindle

image

In the words of Doctor Smith from lost in space, while everyone else was in Vegas having a grand old time, I was at a client site, having to come to grips with the beast known as Windows Small Business Server 2008. I rarely work with SBS2003 and had never used SBS2008 until now.

This was one of those engagements that is somewhat similar to those awful dreams that you have when you are trying to get to some place, but you never quite get there and your subconscious puts all sorts of strange and surreal obstacles in your path. In my case, the surreal obstacles were very real, yet some of them were really really dumb. Whatsmore, it is a very sad indictment on IT at several levels and a testament to how complexity will never be tamed with yet more complexity.

As a result, I really fear the direction that IT in general is heading.

So where to begin? This project was easy enough in theory. A former colleague called me up because he knew of my dim, dark past in the world of Cisco, Active Directory and SharePoint. He asked me to help put in SBS2008 for him, configuring Exchange/AD/SharePoint and migrating his environment over to it.

“Sure”, I say, “it’ll be a snap” (famous last words)

image

I haven’t use the coffee or tequila ratings for a while, so I thought that this post was apt for dusting them off. If you check the Why do SharePoint Projects Fail series, you will see that I use tequila shots or coffee at times. In this case, I will use the tequila shots to demonstrate my stress levels.

Attempt 1

We start our sorry tale a few weeks ago, where my client had ordered Small Business Server 2008 and the media/key had not arrived by the time I was due to start. The supplier came to the rescue by sending them a copy of the media and promised to send the license key in a couple of days.

The server was a HP Proliant DL360 G6, a seemingly nice box with some good features at a reasonable price. HP/Compaq people will be familiar with the SmartStart software and process, where instead of using the windows media, you pop in the supplied SmartStart CD and it will perform some admin tasks, before asking for the windows media, auto-magically slipstreaming drivers and semi-automating the install.

On client machines, I never use the CD from the vendor because there is always too much bloatware crap. However on servers I generally do use the CD, because it tends to come with all the tools necessary to manage disk storage, firmware and the like. I dutifully popped in SmartStart CD, answered a few basic questions, and it asked for the windows media CD.

Cleverworkarounds stress rating: Good so far

Next it asked me for the Windows SBS2008 licence key. Of course, I was using media that had been lent to us from the supplier because my client’s media (and keys) had not arrived. Thus, since I did not have a license key I was unable to proceed with the install using this preferred manner. HP, in their infinite wisdom, have assumed that you always have the license key when you install via their SmartStart CD, despite Microsoft giving you 30 days to activate the product. To be fair on HP, they are hooking into Microsoft’s unattended installation framework, so perhaps the blame should be shared.

All was not lost however. The SmartStart CD can be run after windows has been installed. It then will install all the necessary “HP bits” like graphics and system board drivers. So I booted off the Windows CD and fortunately, windows installer detected the HP storage controller and the disk array,  and proceeded to let me partition it and install.

Cleverworkarounds stress rating: Minor annoyance, but good so far

Small Business Server 2008 did its thing and then loaded up a post install wizard that sets the timezone, active directory domain name and the like. At a certain step when running this wizard, SBS2008 informed me that there was no network card with a driver loaded, so it could not continue. As it turns out, Microsoft’s initial SBS2008 configuration wizard simply will not proceed unless it finds a valid network card. But since this is the pre-install wizard, we are not yet at the point in the installation where we have a proper windows desktop with start menu and windows explorer. All is not lost (apparently), because SBS allows you to start device manager from within the wizard and search for the driver.

Fair enough, I think to myself, so I pop in the HP SmartStart CD and tell device manager to search the media.

Cleverworkarounds stress rating: Spider senses tingle that today might not turn out well

image

Windows device manager comes back and tells me that it cannot find any drivers.

Why?

Well after some examination, the SmartStart drivers are all self extracting executables and therefore device manager could not find them when I told it to. Of course, the self extracting zip files have hugely meaningful names like C453453.EXE making it really obvious to work out what driver set is the one required… not!

Luckily, Ctrl+Alt+Del gave me task manager which allowed me to start a windows explorer session, and I was able to browse to the CD and run the autorun of the SmartStart CD manually. This loaded up HP’s fancy schamsy driver install software that produces a nice friendly report on what system software is missing and proceeds to install it all for you.

SmartStart did its thing, finding all of the driverless hardware and installed the various drivers. A few minutes and a reboot later and SBS2008 reruns its configuration wizard and this time finds the network and allows me to complete the wizard. This triggers another thirty minutes of configuration and another reboot and we have ourselves a small business server!

Cleverworkarounds stress rating: Spider senses subside – back on track?

Next I did something that is a habit that has served me well over the years (until now). I reran the Smartstart CD, now with a network and internet access. This time I told the driver management utility to connect to HP.COM. It scanned HP and reported to me that most drivers on the SmartStart CD were out of date. This is unsurprising because most of the time I do server builds for any vendor, I find that about half of the drivers, BIOS and various firmwares have been replaced by newer versions since the CD was pressed.

Since this is a brand new server build, it is a habit of mine to upgrade to the latest drivers, BIOS and firmware before going any further.

Among the things found to be out of date was the BIOS, the firmware on the RAID Storage controller as well as the network card. The SmartStart software downloaded all of these updates, and another reboot later, all are installed happily. Another hour of patching via Windows update, and we have a ready to go SBS2008 server with WSS3, Exchange, SQL Express and WSUS all configured automatically for you.

Cleverworkarounds stress rating: This SBS2008 stuff isn’t so bad right?

Okay, so things were good so far, but now here is where the fun really begins.

Windows 2008 SBS comes with a pre-installed WSS3 site called http://Companyweb. As we all know, search completely sucks in WSS3. It has a bunch of limitations and isn’t a patch on what you get with MOSS. But no problem – we have Microsoft Search Server Express now, a free upgrade which turns WSS search from complete horribleness to niceness fairly quickly.

For those of you reading this who run WSS3 and have not installed Search Server Express, I suggest you investigate it as it does offer a significant upgrade of functionality. Search server express pretty much makes WSS3 have the same search capabilities of MOSS 2007.

So, I proceeded to install Search Server 2008 express onto this Small Business Server 2008 box. I have installed Search Server Express quite a few times before and I have to admit, it is a tricky install at times. But given that this was a fresh Small Business Server 2008 install and not in production, as well as having successfully installed it on Small Business Server 2003 previously, I felt that I should be safe.

I commenced the Search Server 2008 express install, and the first warning sign that my day was about to turn bad showed itself. The install of search server express only allowed me to choose the “Basic” option. The option that I wanted to use, “Advanced” was greyed out and therefore unavailable.

Cleverworkarounds stress rating: Spider senses tingling again

imageimage

Knowing this server was not in production, I went ahead and allowed Search Server Express to install as per the forced basic setting. The install itself appeared to work, but it died during the SharePoint configuration wizard. It specifically crapped out on step 9, with the error message

“Failed to create sample data. An exception of type Microsoft.SharePoint.SPException was thrown.  Additional exception information: User cannot be found”.

“Curses!” I say, “another trip to logs folder in the 12 hive”. For the nerds, the log is pasted below.

[SPManager] [INFO] [10/21/2009 3:38:47 PM]: Finished upgrading SPContentDatabase Name=ShareWebDb Parent=SPDatabaseServiceInstance Name=Microsoft##SSEE.
[SPManager] [DEBUG] [10/21/2009 3:38:47 PM]: Using cached [SPContentDatabase Name=ShareWebDb Parent=SPDatabaseServiceInstance Name=Microsoft##SSEE] NeedsUpgrade value: False.
[SharedResourceProviderSequence] [DEBUG] [10/21/2009 3:38:47 PM]: Unable to locate SearchDatabase. Exception thrown was: System.Data.SqlClient.SqlException: Cannot open database "SharedServices_DB_ed3872ca-06b1-44c5-8ede-5a81b52265f9" requested by the login. The login failed.
Login failed for user ‘NT AUTHORITY\NETWORK SERVICE’.
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.SqlInternalConnectionTds.CompleteLogin(Boolean enlistOK)

The long and short of this error takes a little while to explain. First we need to explain the historical difference between SQL Server Express edition and SQL Server Embedded edition (also known as the Windows internal database). From wikipedia:

SQL Server 2005 Embedded Edition (SSEE): SQL Server 2005 Embedded Edition is a specially configured named instance of the SQL Server Express database engine which can be accessed only by certain Windows Services.

SQL Server Express Edition: SQL Server Express Edition is a scaled down, free edition of SQL Server, which includes the core database engine. While there are no limitations on the number of databases or users supported, it is limited to using one processor, 1 GB memory and 4 GB database files.

Why does this matter? Well Microsoft, being the wise chaps that they are, decided that when you perform a SharePoint installation using the “basic” option, different editions of SharePoint use different editions of SQL Server! Mark Walsh explains it here:

  • When you use the "Basic" install option during MOSS 2007 installation it will install and use SQL Server 2005 Express Edition and you have a 4GB database size limit.
  • When you use the "Basic" install option during WSS 3.0 installation it DOES NOT use SQL Express, it uses SQL Server 2005 Embedded edition and it DOES NOT have a 4GB size limit.

It happens that Small Business Server 2008 comes with WSS3 preinstalled. Annoyingly, but unsurprisingly, the Small Business Server team opted to use the BASIC installation mode. As described above, SQL Server Embedded Edition (known on Win2008 as the Windows Internal Database) is used. For reference, WSUS on Small Business Server 2008 also uses this database instance.

BUT BUT BUT…

Search Server 2008 Express, uses SQL Server Express edition when performing a basic install. As a result, an additional SQL Server Express instance (SERVERNAME\OFFICESERVERS) gets installed onto the Small Business 2008 server. Then, to make matters worse, the installer gets mixed up and installs some Search Server express databases into the new instance (a Shared Service Provider), but then uses the SQL Embedded Edition instance to install other databases (like the searchDB). Then later during the configuration wizard, it cannot find the databases that it needs because it searches the wrong instance!

The net result is the error shown in the log above. I tried all sorts of things like copying the Express databases into the embedded edition, but I couldn’t disentangle this dependency issue. Some parts of SharePoint (the search server express bits no doubt) looked in the SQL Express instance and the WSS bits looked in the Embedded SQL instance. Eventually, conscious of time, I proceeded to uninstall Search Server Express.

Cleverworkarounds stress rating: Some swear words now uttered

imageimageimage

Uninstalling Search Server Express was attempted and tells me that it has successfully completed and wants a reboot. Unfortunately SharePoint is now even more hosed than it was before and I tried a few things to get things back on track (psconfig to create a new farm and the like). After more frustration, and conscious of time I decided to uninstall WSS3 altogether and then reinstall it according to the SBS Repair guide for WSS3.

This had the effect of stuffing up WSUS as well! (I assume because it shares the same Windows Internal Database instance), and after a couple of hours of trying all sorts of increasingly hacky ways of getting all of this working, I was forced to give up.

Note: Whatever you do, do not attempt this method. At one point I tried to trick WSS3 into temporarily thinking it was not a basic mode install so get Search Server Express to prompt for Advanced mode, but it made things worse because the configuration database got confused.

Cleverworkarounds stress rating: Installing SharePoint in basic mode is committing a crime against humanity.

imageimageimageimage

Attempt 2

image

At this point I hear Doctor Smith abusing me like the poor old robot. “You bubble-headed booby, you ludicrous lump, adlepated amateur, dottering dunderhead”.

Since WSUS got completely screwed, as well as the Windows Internal Database through relentless uninstallation and repair attempts, I started to get nervous. Small Business Server 2008 is a very fussy beast. Essentially it can get very upset at seemingly benign changes. I felt that I had messed about so much that I could no longer guarantee the integrity of this server so I wiped clean and reinstalled.

I installed Windows as per the previous method, and once again the install wizard stopped, asking me for a network driver. Once again I popped in the HP SmartStart CD and proceeded to run the driver install program from the SBS configuration wizard.

This time, no network card was detected!

What the? During attempt 1, I happily installed the necessary HP drivers using the same %$^#% SmartStart CD! Why is it not detecting now??

Cleverworkarounds stress rating: Now thinking about how much fun everyone is having in Vegas while I am fighting this server

imageimageimageimageimage

After some teeth gnashing, the cause of this problem hit me as I was driving home from the site. I had upgraded the firmware of the network card during attempt 1, as well as the storage controller and system BIOS. I realised that the stupid, brain-dead HP network card drivers likely could no longer recognise its own network card with a newer firmware.

The next day I came back refreshed, and found that indeed, there were newer network drivers at the HP site. I downloaded them and extracted them and sure enough, suddenly the network card was found and I was back in business. How dumb is that! Surely if you are going to write a driver, at the very least make it recognise the hardware irrespective of firmware!  

Once I got past this stupid annoyance, it did not take too long for SBS2008 to be installed and ready to go. Remember that WSS3 is installed for you in basic mode, so to change it requires an uninstall of WSS and then to reinstall it in a different mode. But the problem here is SBS2008 and its fussiness about messing with configuration. In going down this path, you risk future service packs and updates breaking because things are not as expected. Additionally you would have to create companyweb manually and it raises the risk of a misconfiguration or mistake along the way.

I logged a call with Microsoft and got a pretty good engineer (hey Ai Wa) and she was able to consistently reproduce all of my issues in the lab, but was unable to work out a supportable fix. In the meantime, I tried to force Search Server 2008 to install into advanced mode using a scripted install, using an article written by my old mate Ben Curry. Alas, I could not bend it to my will and at the time of writing this article, I had to give up on Search Server 2008 with SBS2008 for now.

For what its worth, I know that I can make this work by installing WSS3 differently, but I planned to properly nail this issue in my lab using the out of the box installs and then publish this article, but I didn’t have the sufficient hardware to run Small Business Server 2008. It requires a lot of grunt to run! So I will revisit this issue once my new VM server arrives and post an update.

But there is more…

If the whole Smartstart, network driver, SBS2008 with its dodgy scripted WSS3 install, with Search Server Express dumb installation assumptions were not enough, I hit more dumb things that resulted in showing how ill equipped HP’s support is able to deal with these sorts of issues.

The server, not surprisingly, supplied with a nice hardware RAID configuration and my client opted to buy some additional disk. When the disks arrived and were installed, we found that the RAID controller could see the disks, but we were unable to add the disks to the existing array using HP’s management tools. HP’s “friendly” support was unable to work out the issue and asked me to do things that were never going to work and insulted my intelligence. Eventually I worked out what was going on myself, via HP’s own forums. It turns out that the HP Server requires a write cache module to be able to grow an array. We had one of these installed. Upon further examination by opening up the chassis, we were missing a battery to go with the write cache module. HP were unable to determine the part number that we needed and we ended up working it out ourselves and telling our supplier.

Then HP stuffed up the order and after following up for two weeks, it turned out they accidentally forgot to put the order through to Singapore to get the part. It seemed that once we went outside of the normal supply chain system at HP, it all broke down. After two weeks and numerous calls, they suddenly realised that the part was available in Sydney all along and it was shipped over next day!. The irony was that next day the part from Singapore arrived!

So now we had two batteries.

Grrrr. It shouldn’t be this hard! Why supply a write cache module and not supply the battery! Dumb Dumb Dumb!

Cleverworkarounds stress rating: Really hating HP and Microsoft at this point.

imageimageimageimageimageimage

… and piece de resistance!

Okay so we had a few frustrating struggles, but we more or less got there. But here is the absolute showstopper – the final issue that for me, really made me question this IT discipline that I have worked in for twenty years now.

My client rang me on Monday morning to tell me the server was powered off when we arrived on site. This was unusual because the rack was UPS protected and no other devices were off. We ran HP’s diagnostic tools and no faults were reported. The UPS was a fairly recent APC model, and we installed a serial cable to the server and loaded the APC UPS Management software. The software (which looks scarily like what 16 bit apps used to look like in the early 90’s), found the UPS and showed that was all hunky dory.

We decided to perform a battery test using the software. No sooner than I clicked OK, the server powered off with no warning or shutdown. Whoa!!

I put in a call to APC, and was told that the UPS was not compatible with HP’s new fancy shmansy green star rating power supplies. We had to buy a new UPS because of some “sine wave” mumbo jumbo (UPS engineer talk that I really wasn’t interested in). If the UPS switched to battery, this server would think power was dodgy and do a pre-emptive shutdown. The reason that the server was powered off on the Monday morning was that the UPS has a built in self test that runs every 7 days that cannot be disabled!

Cleverworkarounds stress rating: For %%$$ sake!.

imageimageimageimageimageimageimageimageimageimageimageimage

Conclusion

Now I don’t know about you, but once a UPS cannot be “compatible” with a server, that’s it – things have gone too far. For crying out loud, a UPS supplies power. How $%#^ hard can that be? How is an organization supposed to get value out of their IT investments with this sort of crap to deal with.

Then add in the added complexity of Blade servers, Citrix, Virtualisation, shared storage, and I truly feel that some sites are sitting on time bombs. The supposed benefits in terms of efficiency, resiliency and scalability that these technologies bring, come at an often intangible and insidious cost – sheer risk from incredible complexity. If you look at this case study of a small organisation putting in a basic server, most of the issues I have encountered are the side effects of this complexity and the lack of ability for the vendors to be able to help with it.

As the global financial crisis has aptly demonstrated, when things are complex and no-one person can understand everything, when something bad does happen, it tends to do it in a spectacularly painful and costly way.

Finally, before you reply to this likely immature rant and tell me I am a whiner, remember this all you Vegas people. You got to have fun and marvel at all the new (complex) SP2010 toys, while I sat on the other side of the planet in a small computer room, all bitter and twisted, sprouting obscenities to HP, Microsoft and APC dealing with this crap. When you put that into perspective, I think this article is quite balanced! 🙂

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Complexity bites: When SharePoint = Risk

Send to Kindle

I think as you age, you become more and more like your parents. Not so long ago I went out paintballing with some friends and we all agreed that the 16-18 year olds who also happened to be there were all obnoxious little twerps who needed a good kick in the rear. At the same time, we also agreed that we were just as obnoxious back when we were that age. Your perspective changes as you learn and your experience grows, but you don’t forget where you came from.

I now find myself saying stuff to my kids that my parents said to me, I think today’s music is crap, I have taken a liking to drinking quality scotch. Essentially all I need now to complete the metamorphosis to being my father is for all my hair to fall out!

So when I write an article whining about an assertion that IT has a credibility issue and has gone backwards in its ability to cope with various challenges, I fear that I have now officially become my parents. I’ll sound like grandpa who always tells you that life was so much simpler back in the 1940’s.

Consequences of complexity…

Before I go and dump on IT as a discipline, how about we dump on finance as a discipline, just so you can be assured that my cynicism extends far beyond nerds.

I previously wrote about how Sarbanes Oxley legislation was designed to, yet ultimately failed to, provide assurance to investors and regulators that public companies had adequate controls over their financial risk. As I write this, we are in the midst of a once in a generation-or-two credit crisis where some seven hundred billion dollars ($700,000,000,000) of US taxpayers’ money will be used to take ownership of crap assets (foreclosed or unpaid mortgages).

Part of the problem with the credit crisis was through the use of "collateralized debt obligations". This is a fancy, yet complex, way of taking a bunch of mortgages, and turning them into an "asset" that someone else who has some spare cash invests in. If you are wondering why the hell someone would invest in such a thing, then consider people with home loans, supposedly happily paying interest on those mortgages. It is that interest that finds its way to the holder (investor) of the CDO. So a CDO is supposedly an income stream.

Now if that explanation makes your eyes glaze over then I have bad news for you: that’s supposed to be the easy part. The reality is that the CDO’s are actually extremely complex things. They can be backed by residential property, commercial property, something called mortgage backed securities, corporate loans – essentially anything that someone is paying interest on can find its way into a CDO that someone else buys into, to get the income stream from the interest paid.

To provide "assurance" that these CDO’s are "safe", ratings agencies give them a mark that investors rely upon when making their investment. So a "AAA" CDO is supposed to have been given the tick of approval by experts in debt instrument style finance.

Here’s the rub about rating agencies. Below is a news article from earlier in the year with some great quotes

http://www.nytimes.com/2008/03/23/business/23how.html?pagewanted=print

Credit rating agencies, paid by banks to grade some of the new products, slapped high ratings on many of them, despite having only a loose familiarity with the quality of the assets behind these instruments.

Even the people running Wall Street firms didn’t really understand what they were buying and selling, says Byron Wien, a 40-year veteran of the stock market who is now the chief investment strategist of Pequot Capital, a hedge fund. “These are ordinary folks who know a spreadsheet, but they are not steeped in the sophistication of these kind of models,” Mr. Wien says. “You put a lot of equations in front of them with little Greek letters on their sides, and they won’t know what they’re looking at.”

Mr. Blinder, the former Fed vice chairman, holds a doctorate in economics from M.I.T. but says he has only a “modest understanding” of complex derivatives. “I know the basic understanding of how they work,” he said, “but if you presented me with one and asked me to put a market value on it, I’d be guessing.”

What do we see here? How many people really *understand* what’s going on underneath the complexity?

Of course, we now know that many of the mortgages backing these CDO’s were made to people with poor credit history, or with a high risk of not being able to pay the loans back. Jack up the interest rate or the cost of living and people foreclose or do not pay the mortgage. When that happens en masse, we have a glut of houses for sale, forcing down prices, lowering the value of the assets, eliminating the "income stream" that CDO investors relied upon, making them pretty much worthless.

My point is that the complexity of the CDO’s were such that even a guy with a doctorate in economics only had a ‘modest understanding’ of them. Holy crap! If he doesn’t understand it then who the hell does?

Thus, the current financial crisis is a great case study in the relationship between complexity and risk.

Consequences of complexity (IT version)…

One thing about doing what I do, is that you spent a lot of time on-site. You get to see the IT infrastructure  and development at many levels. But more importantly, you also spend a lot of time talking to IT staff and organisation stakeholders with a very wide range of skills and experience. Finally and most important of all, you get to see first hand organisational maturity at work.

My conclusion? IT is completely f$%#ed up across all disciplines and many will have their own mini equivalent of the US $700 billion dollar haemorrhage. Not only that, it is far worse today than it previously was – and getting worse! IT staff are struggling with ever accelerating complexity and the "disconnect" between IT and the business is getting worse as well. To many businesses, the IT department has a credibility problem, but to IT the feeling is completely mutual 🙂

You can find a nice thread about this topic on slashdot. My personal favourite quote from that thread is this one

Let me just say, after 26 years in this business, of hearing this every year, the systems just keep getting more complex and harder to maintain, rather than less and easier.

Windows NT was supposed to make it so anyone who could use Windows could manage a server.

How many MILLION MSCEs do we have in the world now?

Storage systems with Petabytes of data are complex things. Cloud computing is a complex thing. Supercomputing clusters are complex things. World-spanning networks are complex things.

No offense intended, but the only people who think things are getting easier are people who don’t know how they work in the first place

Also there is this…

There are more software tools, programming languages, databases, report writers, operating systems, networking protocols, etc than ever before. And all these tools have a lot more features than they used to. It’s getting increasingly harder to know "some" of them well. Gone are the days when just knowing DOS, UNIX, MVS, VMS, and OS/400 would basically give you knowledge of 90% of the hardware running. Or knowing just Assembly/C/Cobol/C++ would allow you to read and maintain most of the source code being used. So I would argue that the need for IT staff is going to continue to increase.

I think the "disconnect" between IT and Business has a lot more to do with the fact that business "knows" they depend on IT, but they are frustrated that IT can’t seem to deliver what they want when they want it. On the other side, IT has to deal with more and more tools and IT staff has to learn more and more skills. And to increase frustration in IT, business users frequently don’t deliver clear requirements or they "change" their mind in the middle of projects….

So it seems that I am not alone 🙂

I mentioned previously that more often than not, SQL Server is poorly maintained – I see it all the time. Yet today I was speaking to a colleague who is a storage (SAN) and VMware virtualisation god. I asked him what the average VMware setup was like and his answer was similar to my SQL Server and SharePoint experience. In his experience, most of them were sub-optimally configured, poorly maintained, poorly documented and he could not provide any assurance as to the stability of the platform.

These sorts of quality assurance issues are rampant in application development too. I see the same thing most definitely in the security realm too.

As the above quote sates, "it’s increasingly harder to know *some* of them well". These days I am working with specialists who live and breathe their particular discipline (such as storage, virtualisation, security or comms). Those disciplines over time grow more complex and sub-disciplines appear.

Pity then, the poor developer/sysadmin/IT manager who is trying to keep a handle on all of this and try to provide a decent service to their organisation!

Okay, so what? IT has always been complex – I sound like a Gartner cliche. What’s this got to do with SharePoint?

Consequences of SharePoint complexity…

SharePoint, for a number of reasons, is one of those products that has a way of really laying bare any gaps that the organisation has in terms of their overall maturity around technology and strategy.

Why?

Because it is so freakin’ complex! That complexity transcends IT disciplines and goes right to the heart organisational/people issues as well.

It’s bad enough getting nerds to agree on something, let alone organisation-wide stakeholders!

Put simply, if you do a half-arsed job of putting SharePoint in, you will be punished in so many ways! The simple fact is that the odds are against you before you even start because it only takes a mistake in one particular part of the complex layers of hardware, systems, training, methodology, information architecture and governance, to devalue the whole project.

When I first started out, I was helping organizations get SharePoint installed. However lately I am visiting a lot of sites where SharePoint has already been installed, but it has not been a success. There are various reasons;I have cited them in detail in the project failure series, so I won’t rehash all that here. (I’d suggest reading parts three, four and five in particular).

I am firmly of the conclusion that much of SharePoint is more art than science, and what’s more, the organisation has to be ready to come with you. Due to differing learning styles and poor communication of strategy, this is actually pretty rare. Unfortunately, IT are not the people who are well suited to "getting the organisation ready for SharePoint."

If that wasn’t enough, then there is this question. If IT already struggle to manage the underlying infrastructure and systems that underpin SharePoint, then how can you have any assurance that IT will have a "governance epiphany" and start doing things the right way?

This translates to risk, people! I will be writing all about risk in a similar style to the CFO Return on Investment series very soon. I am very interested in methods to quantify the risk brought about by the complexity of SharePoint and the IT services it relies on. For me, I see a massive parallel from the complexity factor in the current financial crisis and I think that a lot can be learned from it. SOX was supposed to provide assurance and yet did nothing to prevent the current crisis. Therefore, SOX represents a great example of mis-focused governance where a lot of effort can be put in for no tangible gain.

A quick test of "assurance"…

Governance is like learning to play the guitar. It takes practice, and it does not give up its secrets easily and despite good intent, you will be crap at it for a while. It is easy to talk about, but putting it into practice is another thing.

Just remember this. The whole point of the exercise is to provide *assurance* to stakeholders. When you set any rule, policy, procedure, standard (or similar), just ask yourself: Does this provide me the assurance I need that gives me confidence to vouch for the service I am providing? Just because you may be adopting ITIL principles, does *not* mean that you are necessarily providing the right sort assurance that is required.

I’ll leave you with a somewhat biased, yet relatively easy litmus test that you can use to test your current level of assurance.

It might be simplistic, but if you are currently scared to apply a service pack to SharePoint, then you might have an assurance issue. 🙂

 

Thanks for reading

 

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Sometimes "Microsoft bashing" is justified

Send to Kindle

Microsoft bashing is a favourite pastime of many a nerd. Whether it is justified or not in many cases is debatable since M$ will never please everyone. But the point is, it is cathartic and in actual fact, good therapy because venting your frustrations at Bill Gates is much healthier than at your colleagues or family.

To my Microsoft employee friends reading this. Don’t feel all defensive – some of the very best Microsoft bashing I have ever heard comes from you guys anyway 🙂

So although sometimes the M$ bashing is completely unjustified, long shall it continue to preserve the sanity of IT professionals around the globe.

Having said that, on occasion you will hit some Microsoft induced pain that is legitimately and frustratingly dumb. By "legitimately", I mean that you cannot say "although in hindsight it was dumb, I can actually understand why they decided to do that". Instead you get caught out and experience pain and frustration simply because of a silly Microsoft oversight.

In this case, the oversight is with the SharePoint Configuration Wizard

Continue reading

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

SQL God? No… I just know how to do a maintenance plan

Send to Kindle

image 

I am working on an essay about IT complexity at the moment, and one thing that sprung into my mind while thinking about this, is the fact that many of my clients seem to think I am some sort of SQL Server guru.

There are two sad realities inferred by this.

Firstly, I am far from a SQL Server god. Yes I have experience with it, but the only reason people think I’m any good at it stems from the general lack of knowledge that they have about the product. Often all I have to do is waltz in with my Michael Buble like smooth charm and recommend a maintenance plan be set up and I am instantly the guy to talk to in all things SQL.

The truth of the matter is that I’m not fit to lick the boots of a skilled DBA. But like all former system administrators/infrastructure managers who have had to deal with the pressure and consequences of downtime, irrespective of the product, I developed a reflex to learn what I need to to cover my butt. Often I didn’t care less how a product operated from the end-user perspective. All I cared about was how it hung together so I could recover it when it inevitably failed in some way.

Continue reading

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle

Good advice hidden in the Infrastructure Update

Send to Kindle

I guess the entire SharePoint world now is aware of the post SP1 "Infrastructure updates" put out by Microsoft recently. Probably the best thing about them are that the flaky "content deployment" feature has had some serious work done on it. (My advice has always been use with extreme caution or avoid it, but now I will have to reassess).

Anyway, that is not what I am writing about. Being a former IT Security consultant, I have always installed SharePoint in a "least privilege" configuration. I use different service accounts for search, farm, SSP, reporting services and web applications. The farm account in particular, is one that needs to be very carefully managed given its control over the heart of SharePoint – the configuration database.

Specifically, the farm account never should have any significant rights on any SharePoint farm server, over and above what it minimally needs (which is some SQL Server rights and some DCOM permission stuff). For what it’s worth, I do not use the Network Service account either, as it is actually more risky than a low privileged domain or local user account.

Developers on the other hand tend to run their boxes as full admin. I have no problem with this, so long as there is a QA or test server that is running least privilege.

However, as always with tightening the screws, there are some potential side effects in doing this. There are occasions when the farm account actually needs to perform a task that requires higher privileges, over and above what it has by default. Upgrading SharePoint from a standard to enterprise license is one such example, and it seems that performing the infrastructure update may be another.

If you take the time to read the installation instructions for the infrastructure update, there is this gem of advice:

To ensure that you have the correct permissions to install the software update and run the SharePoint Products and Technologies Configuration Wizard, we recommend that you add the account for the SharePoint Central Administration v3 application pool identity to the Administrators group on each of the local Web servers and application servers and then log on by using that account

You may wonder why this even matters? After all, it is exceedingly likely that you logged into the server as an administrator or domain administrator anyway. Surely you have all the permission that you need, right?

The answer is that not all update tasks are run by your logged-in account. Although you start the installation using your account, at a certain point during the install, many tasks are performed by the SharePoint Central Administration application. Thus, despite you having administrative permissions, SharePoint (using the farm account) will not have.

So the advice above essentially says, temporarily add the SharePoint farm account to the local administrators group and then log into the server as that user account. Now perform the action as instructed. That way the account that starts the installation is the same account that runs SharePoint and thus we know that we are using the correct account with the correct server privileges.

Once the installation has been completed, you can log out of the farm account and then revoke its administrator access, and we are back to the original locked down configuration.

So keep this in mind when performing farm tasks that are likely to require some privileges at the operating system level. It is a good habit to get into (provided you remember to lock down permissions afterwards 🙂 ).

Cheers

Paul

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags

Send to Kindle