Back to Cleverworkarounds mainpage
 

Demystifying SharePoint Performance Management Part 11 – Tales from the Microsoft labs

Hi all and welcome to the final article in my series on SharePoint performance management – for now anyway. Once SharePoint 2013 goes RTM, I might revisit this topic if it makes sense to, but some other blogging topics have caught my attention.

To recap the entire journey, the scene was set in part 1 with the distinction of lead and lag indicators. In part 2, we then examined Requests per Second (RPS) and looked at its strengths and weakness as a performance metric. From there, we spent part 3 looking at how to leverage RPS via the Log Parser utility and a little PowerShell goodness. Part 4 rounded off our examination of RPS by delving deeper into utilising log parser to eke out interesting RPS related performance metrics. We also covering the very excellent SharePoint Flavored Weblog Reader utility, which saves a bunch of work and can give some terrific insights. Part 5 switched tack into the wonderful world of latency, and in particular, focused on disk latency. Part 6 then introduced the disk performance metrics of IOPS, MBPS and their relationship to latency. We also looked at typical SharePoint and SQL Server disk IO characteristics and then examined the pros and cons of RPS, IOPS, Latency, MBPS and how they all relate to each other. In part 7 and continuing into part 8, we introduced the performance monitor counters that give us insight into these counters, as well as introduced the SQLIO utility to stress test disk infrastructure. This set the scene for part 9, where we took a critical look at Microsoft’s own real-world findings to help us understand what suitable figures would be. The last post then introduced a couple of other excellent tools, namely Process Monitor and Windows Performance Analysis Toolkit that should be in your arsenal for SharePoint performance.

In this final article, we will tie up a few loose ends.

Insights from Microsoft labs testing…

In part 9 of this series, I examined Microsoft’s performance figures reported from their production SharePoint 2010 deployments. This information comes from the oft mentioned SharePoint 2010 Capacity Planning Guide. Microsoft are a large company and they have four different SharePoint farms for different collaborative scenarios. To recap, those scenarios were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company. Performance reported for this environment was 33580 unique users per day, with an average of 172 concurrent users and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. Performance reported for this environment was double the first environment with 69702 unique users per day. Concurrency was more than double, with an average of 420 concurrent users and a peak concurrency of 1433 users.
  3. Departmental collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department. Performance reported for this environment was a much lower figure of 9186 unique users per day (which makes sense given it is departmental stuff). Nevertheless, concurrency was similar to the enterprise intranet scenario with an average of 189 concurrent users and a peak concurrency of 322 users.
  4. Social collaboration environment. This is Microsoft’s My Sites scenario, connecting employees with one another and presenting personal information such as areas of expertise, past projects, and colleagues to the wider organization. This included personal sites and documents for collaboration. Performance reported for this environment was 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

Presented as a table, we have the following rankings:

Scenario Social Collaboration Enterprise Intranet Collaboration Enterprise Intranet Departmental Collaboration
Unique Users 69814 69072 33580 9186
Avg Concurrent 639 420 172 189
Peak  Concurrent 1186 1433 376 322

When you think about it, the performance information reported for these scenarios are lag indicator based. That is, they are real-world performance statistics based on a pre-existing deployment. Thus while we can utilise the above figures for some insights into estimating the performance needs of our own SharePoint environments, they lack important detail. For example: in each scenario above, while the SharePoint farm topology was specified, we have no visibility into how these environments were scaled out to meet performance needs.

Some lead indicator perspective…

Luckily for us, Microsoft did more than simply report on the performance of the above four collaboration scenarios. For two of the scenarios Microsoft created test labs and published performance results with different SharePoint farm topologies. This is really handy indeed, because it paints a much better lead indicator scenario. We get to see what bottlenecks occurred as the load on the farm was increased. We also get insight about what Microsoft did to alleviate the bottlenecks and what sort of a difference it made.

The first lab testing was based off Microsoft’s own Departmental collaboration environment (the 3rd scenario above) and is covered on pages 144-162 of the capacity planning guide. The other lab was based off the Enterprise Intranet Collaboration Environment (the 2nd scenario above) and is the focus of attention on pages 174-197. Consult the guide for full detail of the tests. This is just a quick synthesis.

Lab 1: Enterprise Intranet Collaboration Environment

In this lab, Microsoft took a subset of the data from their production environment using different hardware. They acknowledge that the test results will be affected by this, but in my view it is not a show stopper if you take a lead indicator viewpoint. Microsoft tested web server scale out initially by starting out with a 3 server topology (one web front end server, one application server and one database server). They then increased the load on the farm until they reached a saturation point. Once this happened, they added an additional web server to see what would happen. This was repeated and scaled from one Web server (1x1x1) to five Web servers (5x1x1).

The transactional mix used for this testing was based on the breakdown of transactions from the live system. Little indication of read vs. write transactions are given in the case study, but on page 152 there is a detailed breakdown of SharePoint traffic by type. While I won’t detail everything here, regular old browser traffic was the most common, representing 36% of all test traffic. WebDAV came in second (WebDAV typically includes office clients and windows explorer view) representing 28.12 of traffic and outlook sync traffic third at 7.04%.

Below is a table showing the figures where things bottlenecked. Microsoft produce many graphs in their documentation so the figures below are an approximation based on my reading of them. It is also important to note that Microsoft did not perform tests while search was running, and compensated for search overhead by defining a max CPU limit for SQL Server of 80%.

1*1*1 2*1*1 3*1*1 4*1*1 5*1*1
Max RPS 180 330 510 560 565
Sustainable RPS 115 210 305 390 380
Latency .3 .2 .35 .2 .2
IOPS 460 710 910 920 840
WFE CPU 96% 89% 89% 76% 58%
SQL CPU 17% 33% 65% 78% 79%

For what its worth, the sustainable RPS figure is based on the server not being stressed (all servers having less than 50% CPU). Looking at the above figures, several things are apparent.

  1. The environment scaled up to four Web servers before the bottleneck changed to be CPU usage on the database server
  2. Once database server CPU hit its limits, RPS on the web servers suffered. Note that RPS from 4*1*1 to 5*1*1 is negligible when SQL CPU was saturated.
  3. The addition of the fourth Web server had the least impact on scalability compared to the preceding three (RPS only increased from 510 to 560 which is much less then adding the previous web servers). This suggests the SQL bottleneck hit somewhere between 3 and 4 web servers.
  4. The average latency was almost constant throughout the whole test, unaffected by the number of Web servers and throughput. This suggests that we never hit any disk IO bottlenecks.

Once Microsoft identified the point at which Database server CPU was the bottleneck (4*1*1), they added an additional database server and then kept adding more webservers like they did previously. They split half the content databases onto one SQL server and half on the other. It is important to note that the underlying disk infrastructure was unchanged, meaning that total disk performance capability was kept constant even though there were now two database servers. This allowed Microsoft to isolate server capability from disk capability. Here is what happened:

4*1*1 4*1*2 6*1*2 8*1*2
RPS 560 660 890 930
Latency .2 .35 .2 .2
IOPS 910 1100 1350 1330
WFE CPU 76% 87% 78% 61%
SQL CPU 78% 33% 52% 58%

Here is what we can glean from these figures.

  1. Adding a second database server did not provide much additional RPS (560 to 660). This is because CPU utilization on the Web servers was high. In effect, the bottleneck shifted back to the web front end servers.
  2. With two database servers and eight web servers (8*1*2), the bottleneck became the disk infrastructure. (Note the IOPS at 6*1*2 is no better than 8*1*2).

So what can we conclude? From the figures shown above, it appears that you could reasonably expect (remember we are talking lead indicators here) that bottlenecks are likely to occur in the following order:

  1. Web Server CPU
  2. Database Server CPU
  3. Disk IOPS

It would be a stretch to suggest when each of these would happen because there are far too many variables to consider. But let’s now examine the second lab case study to see if this pattern is consistent.

Lab 2: Divisional Portal Environment

In this lab, Microsoft took a different approach from lab we just examined. This time they did not concern themselves with IOPS (“we did not consider disk I/O as a limiting factor. It is assumed that an infinite number of spindles are available”). The aim this time was to determine at what point a SQL Server CPU bottleneck was encountered. Based on what I have noted from the first lab test above, unless your disk infrastructure is particularly crap, SQL Server CPU should become a bottleneck before IOPS. However, one thing in common with the last lab test was that Microsoft factored in the effects of an ongoing search crawl by assuming 80% SQL Server CPU as the bottleneck indicator.

Much more detail was provided on the transaction breakdown for this lab. Page 181 and 182 lists transactions by type and and unlike the first lab, whether they are read or write. While it is hard to directly compare to lab 1, it appears that more traffic is oriented around document collaboration than in the first lab.

The basic methodology of this was to start off with a minimal farm configuration of a combined web/application server and one database server. Through multiple iterations, the test ended with a configuration of three Web servers, one application server, one database server.  The table of results are below:

1*1 1*1*1 2*1*1 3*1*1
RPS 101 150 318 310
Sustainable RPS 75 99 191 242
Latency .81 .85 .6 .8
Users simulated 125 150 200 226
WFE CPU 86% 36% 76% 42%
APP CPU NA 41% 46% 44%
SQL CPU 18% 32% 56% 75%

Here is what we can glean from these figures.

  1. Web Server CPU was the first bottleneck encountered.
  2. At a 3*1*1 configuration, SQL Server CPU became the bottleneck.  In lab 1 it was somewhere between the 3rd and 4th webserver.
  3. RPS, when CPU is taken into account, is fairly similar between each lab. For example, in the first lab, the 2*1*1 scenario RPS was 330. In this lab it was 318 and both had comparable CPU usage. The 1*1*1 test, had differing results (101 vs 180) , but if you adjust for the reported CPU usage, things even up.
  4. With each additional Web server, increase in RPS was almost linear. We can extrapolate that as long as SQL Server is not bottlenecked, you can add more Web servers and an additional increase in RPS is possible.
  5. Latencies are not affected much as we approached bottleneck on SQL Server. Once again, the disk subsystem was never stressed.
  6. The previous assertion that bottlenecks are likely to occur in the the order of Web Server CPU, Database Server CPU and then Disk subsystem appears to hold true.

Now we go any further, one important point that I have neglected to mention so far is that the figures above are extremely undesirable. Do you really want your web server and database server to be at 85% constantly? I think not. What you are seeing above are the upper limits, based on Microsoft’s testing. While this helps us understand maximum theoretical capacity, it does not make for a particularly scalable environment.

To account for the issue of reporting on max load, Microsoft defined what they termed as a “green zone” of performance. This is a term to describe what “normal” load conditions look like (for example, less than 50% CPU) and they also provided RPS results for when the servers were in that zone. If you look closely in the above tables you will see those RPS figures there as I labelled them as “Sustainable RPS”.

In case you are wondering, the % difference between sustainable RPS and peak RPS for each of the scenarios ranges between 60-75% of the peak RPS reported.

Some Microsoft math…

In the second lab, Microsoft offers some advice on how translate their results into our own deployments. They suggest determining a users to RPS ratio and then utilising the green zone RPS figures to estimate server requirements. This is best served via their own example from lab 2: They state the following:

  • A divisional portal in Microsoft which supports around 8000 employees collaborating heavily, experiences an average RPS of 110.
  • That gives a Users to RPS ratio of ~72 (that is, 8000/110). That is: 72 users will amount to 1 RPS.
  • Using this ratio and assuming the sustainable RPS figures from lab 2 results, Microsoft created the following table (page 196) to suggest the number of users a typical deployment might support.

image

A basic performance planning methodology…

Okay.. so I am done… I have no more topics that I want to cover (although I could go on forever on this stuff). Hopefully I have laid out enough conceptual scaffolding to allow you to read Microsoft’s large and complex documentation regarding SharePoint performance and capacity guide with more clarity than before. If I were to sum up a few of the key points of this 11 part exploration into the weird and wonderful world of SharePoint performance management it would be as follows:

  1. Part 1: Think of performance in terms of lead and lag indicators. You will have less of a brain fart when trying to make sense of everything.
  2. Part 2: Requests are often confused with transactions. A transaction (eg “save this document”) usually consists of multiple requests and the number of requests is not an indicator of performance. Look to RPS to help here…
  3. Part 3 and 4: The key to utilising RPS is to understand that as a counter on its own, it is not overly helpful. BUT it is the one metric that you probably have available in lots of detail, due to it being captured in web server logs over time. Use it to understand usage patterns of your sites and portals and determine peak usage and concurrent usage.
  4. Part 5: Latency (and disk latency in particular) is both unavoidable, yet one of the most common root causes of performance issues. Understanding it is critical.
  5. Part 6: Disk latency affects – and is affected by – IOPS, IO size and IO patterns. Focusing one one without the others is quite pointless. They all affect each other so watch out when they are specified in isolation (ie 5000 IOPS).
  6. Part 6, 7 and 8:  Latency and IOPS are handy in the  sense that they can be easily simulated and are therefore useful lead indicators. Test all SQL IO scenarios at 8k and 64K IO size and ensure it meets latency requirements.
  7. Part 9: Give your SAN dudes a specified IOPS, IO Size and latency target. Let them figure out the disk configuration that is needed to accommodate. If they can make your target then focus on other bottleneck areas.
  8. Part 10: Process Monitor and Windows Performance Analyser are brilliant tools for understanding disk IO patterns (among other things)
  9. Part 9 and 11: Don’t believe everything you read. Utilise Microsoft’s real world and lab results as a guide but validate expected behaviour by testing your own environment and look for gaps between what is expected and what you get.
  10. Part 11: In general, Web Server CPU will bottleneck first, followed by SQL Server CPU. If you followed the advice of points 6 and 7 above, then disk shouldn’t  be a problem.

Now I promised at the very start of this series, that I would provide you with a lightweight methodology for estimating SharePoint performance requirements. So assuming you have read this entire series and understand the implications, here goes nothing…

If they can meet this target, skip to step 8.  🙂

If they cannot meet this, don’t worry because there are two benefits gained already. First, by finding that they cannot get near the above figures, they will do some optimisation like test different stipe sizes and check some other common disk performance hiccups. This means they now better understand the disk performance patterns and are thinking in terms of lead indicators. The second benefit is that you can avoid tedious, detailed discussions on what RAID level to go with up front.

So while all of this is happening, do some more recon…

  • 4. Examine Microsoft and HP’s testing results that I covered in part 9 and in this article. Pay particular attention to the concurrent users and RPS figures. Also note the IOPS results from Microsoft and HP testing. To remind you, no test ever came in over 1400 IOPS.
  • 5. Use logparser to examine your own logs to understand usage patterns. Not only should you eke out metrics like max concurrent users and RPS figures, but examine peak times of the day, RPS growth rate over time, and what client applications or devices are being used to access your portal or intranet.
  • 6. Compare your peak and concurrent usage stats to Microsoft and HP’s findings. Are you half their size, double their size? This can give you some insight on a lower IOPS target to use. If you have 200 simultaneous users, then you can derive a target IOPS for your storage guys to meet that is more modest and in line with your own organisations size and make-up.

By now the storage guys will come back to you in shock because they cannot get near your 5000 IOPS requirement. Be warned though… they might ask you to sign a cheque to upgrade the storage subsystem to meet this requirement. It won’t be coming out of their budget for sure!

  • 7. Tell them to slowly reduce the IOPS until they hit the 8ms and 1ms latency targets and give them the revised target based on the calculation you made in step 6. If they still cannot make this, then sign the damn cheque!

At this point we have assumed that there is enough assurance that the disk infrastructure is sound. Now its all about CPU and memory.

  • 8. Determine a users to RPS ratio by dividing your total user base by average RPS (based on your findings from step 5).
  • 9.  Look at Microsoft’s published table (page 196 of the capacity planning guide and reproduced here just above this conclusion). See what it suggests for the minimum topology that should be needed for deployment.
  • 10. Use that as a baseline and now start to consider redundancy, load balancing and all of that other fun stuff.

And there you have it! My 10 step dodgy performance analysis method.  Smile

Conclusion and where to go next…

Right! Although I am done with this topic area, there are some next steps to consider.

Remember that this entire series is predicated on the notion that you are in the planning stage. Let’s say you have come up with a suggested topology and deployed the hardware and developed your SharePoint masterpiece. The job of ensuring performance will work to expectations does not stop here. You still should consider load testing to ensure that the deployed topology meets the expectations and validates the lead indicators. There is also a  seemingly endless number of optimisations that can be done within SharePoint too, such as caching to reduce SQL Server load or tuning web application or service application settings.

But for now, I hope that this series has met my stated goal of making this topic area that little bit more accessible and thankyou so much for taking the time to read it all.

 

Paul Culmsee

www.hereticsguidebooks.com

www.sevensigma.com.au



Demystifying SharePoint Performance Management Part 10 – More tools of the trade…

Hi all and welcome to the tenth article in my series on demystifying SharePoint performance management. I do feel that we are getting toward the home stretch here. If you go way back to Part 1, I stated my intent to highlight some common misconceptions and traps for younger players in the area of SharePoint performance management, while demonstrating a better way to think about measuring SharePoint performance (i.e. lead and lag indicators). While doing so, we examined the common performance indicators of RPS, IOPS, MBPS, latency and the tools and approaches to measuring and using them.

I started the series praising some of Microsoft’s material, namely the “Planning guide for server farms and environments for Microsoft SharePoint Server 2010.”, “Capacity Planning for Microsoft SharePoint Server 2010” and “Analysing Microsoft SharePoint Products and Technologies Usage” guides. But they are not perfect by any stretch, and in last post, I covered some of the inconsistencies and questionable information that does exist in the capacity planning guide in particular. Not only are some of the disk performance figures quoted given without any critical context needed to understand how to measure them in a meaningful way, some of the figures themselves are highly questionable.

I therefore concluded Part 9 by advising readers not to believe everything presented and always verify espoused reality with actual reality via testing and measurement.

Along the journey that we have undertaken, we have examined some of the tools that are available to perform such testing and measurement. So far, we have used Log Parser, SharePoint Flavored Weblog Reader, Windows Performance Monitor, SQLIO and the odd bit of PowerShell thrown in for good measure. This article will round things out by showing you two additional tools to verify theoretical fiction with hard cold reality. Both of these tools allow you to get a really good sense of IO patterns in particular (although they both have many other purposes). The first of which will be familiar to my more nerdy readers, but the second of which is highly powerful, but much lesser known to newbies and seasoned IT pros alike.

So without further adieu, lets meet our tools… Process Monitor and Windows Performance Analysis Toolkit.

Process Monitor

Our first tool is Process Monitor, also commonly known as Procmon. Now this tool is quite well known, so I will not be particularly verbose with my examination of it. But for the three of you who have never heard of this tool, Process Monitor allows us to (among many other things), monitor accesses to the file system by processes running on a server. This allows us to get a really low level view of IO requests as they happen in real time. What is really nice about Process Monitor is its granularity. It allows you to set up some sophisticated filtering that allows you to really see the wood from the trees. For example, one can create fairly elaborate filters that allow you to just see the details of a specific SQL database. Also handy is that all collected data can be saved to file for later examination too.

When you start Process Monitor, you will see a screen something like the one below. It will immediately start collecting data about various operations (there are around 140 monitorable operations that cover file system, registry, process, network and kernel stuff). When you launch Process Monitor it immediately starts monitoring file system, registry and processes. The default columns that are displayed include:

  • the name of the process performing the operation
  • the operation itself
  • the path to the object the operation was performed on
  • (and most importantly), a detail column, that tells you the good stuff.

image

The best way to learn Process Monitor is by example, so lets use it to collect SQL Server IO patterns on SharePoint databases when performing a full crawl in SharePoint (while excluding writes to transaction logs). It will be interesting to see the the range of IO request sizes during this time. To achieve this, we need to set up the filters for procmon to give us just what we need…

First up,  choose “Filter…” from the Filter menu.

image

In the top left column, choose “Process Name” from the list of columns. Leave the condition field as “is” and click on the drop down next to it. It will enumerate the running processes, allowing you to scroll down and find sqlservr.exe.

image   image

Click OK and your newly minted filter will be added to the list (check out the green include filter below). Now we will only see operations performed by SQL Server in the Process Monitor display.

image

Rather than give you a dose of screenshot hell, I will not individually show you how to add each filter as they are a similar process to what we just did to include only SQLSERVR.EXE. In all, we have to apply another 5 filters. The first two filter the operations down to reading and writing to the disk.

  • Filter on: Process Name
  • Condition: Is
  • Filter  applied: ReadFile
  • Filter type: Include
  • Filter on: Process Name
  • Condition: Is
  • Filter applied: WriteFile
  • Filter type: Include

Now we need to specify the database(s) that we are interested in. On my test server, SharePoint databases  are on a disk array mounted as D:\ drive. So I add the following filter:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: D:\DATA\MSSQL
  • Filter type: Include

Finally, we want to exclude writes to translation logs. Since all transaction logs write to files with an .LDF extension, we can use an exclusion rule:

  • Filter on: Path
  • Condition: Contains
  • Filter applied: LDF
  • Filter type: Exclude

Okay, so we have our filters set. Now widen the detail column that I mentioned earlier. If you have captured some entries, you should see the word “Length :” with a number next to it. This is reporting the size of the IO request in bytes. Divide by 1024 if you want to get to kilobytes (KB). Below you can see a range of 1.5KB to 32KB.

image

At this point you are all set. Go to SharePoint central administration and find the search service application. Start a full crawl and fairly quickly, you should see matching disk IO operations displayed in Process Monitor. When the crawl is finished, you can choose to stop capturing and save the resulting capture to file. Process Monitor supports CSV format, which makes it easy to import into Excel as shown below. (In the example below I created a formula for the column called “IO Size”.

image

By the way, in my quick test analysis of disk IO of a window during during part of the during a full crawl, I captured 329 requests that were broken down as follows:

  • 142 IO requests (42% of total) were 8KB in size for a total of 1136KB
  • 48 IO requests (15% in total) were 16KB in size for a total of 768KB
  • 48 IO requests (15% in total) were >16KB to 32KB in size for a total of  1136KB
  • 49 IO requests (15% in total) were >32KB to 64KB in size for a total of 2552KB
  • 22 IO requests (7% in total) were >64KB to 128KB in size for a total of 2104KB
  • 20 IO requests (6% in total) were >128KB to 256KB in size for a total of 3904KB

Windows Performance Analyser (with a little help from Xperf123)

Allow me to introduce you to one of the best tools you never knew you needed. Windows Performance Analyser (WPA) is a newer addition to the armoury of tools for performance analysis and capacity planning. In short, it takes the idea of Windows Performance Monitor to a whole new level. WPA comes as part of a broader suite of tools collectively known as the Windows Performance Toolkit (WPT). Microsoft describes the toolkit as:

…designed for analysis of a wide range of performance problems including application start times, boot issues, deferred procedure calls and interrupt activity (DPCs and ISRs), system responsiveness issues, application resource usage, and interrupt storms.”

If most of those terms sounded scary to you, then it should be clear that WPA is a pretty serious tool and has utility for many things, going way beyond our narrow focus of Disk performance. But never fear BA’s, I am not going to take a deep dive approach to explaining this tool. Instead I am going to outline the quickest and simplest way to leverage WPA for examining disk IO patterns. In fact, you should be able to follow what I outline here on your own PC if SharePoint is not conveniently located nearby.

Now this gem of a tool is not available as a separate download. It actually comes installed as part of the Microsoft Windows SDK for Windows 7 and .NET Framework 4. Admins fearing bloat on their servers can rest easy though, as you can choose just to install the WPT components as shown below…

image_thumb6_thumb

By default, the windows performance toolkit will install its goodies into the C:\Program Files\Microsoft Windows Performance Toolkit” folder. So go ahead and install it now since it can be installed onto any version of Windows past Vista. (I am sure that none of you at all are reading this article on an Apple device right? :-).

Now assuming you have successfully installed WPT, I now want you to head on over to codeplex and download a little tool called Xperf123 and save it into the toolkit folder above. Xperf123 is a 3rd party tool that hides a lot of the complexity of WPA and thus is a useful starting point. The only thing to bear in mind is that Xperf123 is not part of WPA and is therefore not a necessity. If your inner tech geek wants to get to know the WPA commands better, then I highly recommend you read a highly comprehensive article written by Microsoft’s Robert Smith and published back in Feb 2012. The article is called “Analysing Storage Performance using the Windows Performance Analysis Toolkit” and it is an outstanding piece of work in this area.

So we are all set. Let’s run the same test as we did with Procmon earlier. I will start a trace on my test SharePoint server, run a full crawl and then look at the resulting IO patterns. Perform the following steps in sequence. (in my example I am using a test SharePoint server).

  1. Start Xperf123 from the WPT installation folder (run it as administrator).
  2. At the initial screen, click Next and then Next again at the screen displaying operating system details
  3. From the Select Trace Type dropdown, choose Disk  I/O and press Next
  4. Ensure that “Enable Perfmon”, “use Circular Logging” and optionally choose “Specify Output Directory”. Press Next
  5. Leave “Stackwalk” unticked and choose Next

image     image

image  image

AllrIghtie then… we are all set! Click the Start Capture Button to start collecting the good stuff! Xperf123 will run the actual WPA command line trace utility (called xperf.exe if you really want to know). Now go to SharePoint central administration and like what we did with our test of Process Monitor, start a full crawl. Wait till the crawl finishes and then in Xperf123, click the Stop Capture Button. A trace file will be saved in the WPT installation folder (or wherever you specify). The naming convention will be based on the server name and date the trace was run.

image  image

image

Okay, so capturing the trace was positively easy. How about analysing it visually?

Double click on the newly minted trace file and it will be loaded into the Performance Analyser analysis tool (This tool is also available from the Start menu as well). When the tool loads and processes the trace file, you will see CPU and Disk IO counts reported visually. The CPU is the line graph and IO counts are represented in a bar graph. Unlike Windows Performance Monitor which we covered in Part 7, this tool has a much better ability to drill down into details.

If you look closely below there are two “flyout” arrows. One is on the left side in the middle of the screen and applies to all graphs and the other is on the top-right of each graph. If you click them, you are able to apply filters to what information is displayed. For example: if you click the “IO Counts” flyout, you can filter out which type of IO you want to see. Looking at the screenshot below, you can see that the majority of what was captured were disk writes (the blue bars below).

image  image

Okay so lets get a closer look at what is going on disk-wise. Right click somewhere on the Disk IO bar graph and choose “Detail Graph” from the menu.

image

Now we have a very different view. On the left we can see which disk we are looking at and on the right we can see detailed performance stats for that disk. It may not be clear by the screenshot, but the disk IO reported below is broken down by process. This detailed view also has flyouts and dropdowns that allow you to filter what you see. There is an upper-left dropdown menu under the section called “Physical Disk”. This allows you to change what disk you are interested in. On the upper right, there is a flyout labelled “Process Name”. Clicking this allows you to filter the display to only view a subset of the process that were running at the time the trace was captured.

image

Now in my case, I only want to see the SQL Database activity, so I will make use of the aforementioned filtering capability. Below I show where I selected the disk where the database files reside and on the right I deselected all processes apart from SQLSERVR.EXE. Neat huh? Now we are looking at the graph of every individual IO operation performed during the time displayed and you can even hover over each dot to get more detail of the IO operation performed.

image

You can also zoom in with great granularity. Simply select a time period from the display using by dragging the mouse pointer over the graph area. Right click the now selected time period and choose “Zoom to Selection”. Cool eh? If your mouse has a wheel button, you can zoom in and out by pressing the Ctrl key and rolling the mouse wheel.

image

Now even for most wussy non technical BA reading this, surely your inner nerd is liking what you see. But why stop here? After all, Process Monitor gave us lots more loving detail and had the ability to utilise sophisticated filtering. So how does WPA stack up?

To answer this question, try these steps, Right click on the detail graph and this time choose “Summary Table”. This allows us to view even more detail of IO data.

image

image

Viola! We now have a list of every IO transaction performed during the sample period. Each line in the summary table represents a single I/O operation. The columns are movable and sortable as well. On that note, some of the more interesting ones for our purposes include (thanks Robert Smith for the explanation of these):

  • IO Type: Read, Write, or Flush
  • Complete Time: Time of I/O completion in milliseconds, relative to start and stop of the current trace.
  • IO Time: The amount of time in milliseconds the I/O took to complete
  • Disk Service Time: The inferred amount of time (in microseconds) the IO operation has spent on the device (this one has caveats, check Robert Smiths post for detail).
  • QD/I: Queue Depth the disk , irrespective of partitions, at the time this I/O request initialized
  • IO Size: Size of this I/O, in bytes.
  • Process Name: The name of the process that initiated this I/O.
  • Path: Path and file name, if known, that is the target of this I/O (in plain English, this essentially means the file name).

I have a lot of IO requests in this summary view, so let’s see how this baby can filter. First up, lets only look at IO that was initiated from SQL Server only. Right click on the “Process Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “SQLSERVR.EXE” in the “Find what:” textbox. Double check that the column name is set to “Process name” in the dropdown and click Filter.

image  image

You should now see only SQL IO traffic. So let’s drill down further still. This time I want to exclude IO transactions that are transaction log related. To do this, right click on the “Path Name” column and choose “Filter To” –> “Search on Column…” In the resulting window, enter “MDF” in the “Find what:” textbox. Double check that the column name is set to “Path name” in the dropdown and click Filter.

image image

Can you guess the effect? Only SQL Server database files will be displayed since they typically have a file extension of MDF.

In the screenshot below, I have then used the column sorting capability to look at the IO sizes. Neat huh?

image

Don’t forget Performance Monitor…

Just before we are done with Windows Performance Analysis Toolkit, cast your mind back to the start of this walkthrough when we used Xperf123 to generate this trace. If you check back, you might recall there was a tickbox in the Xperf123 wizard called “Enable Perfmon”. Well it turns out that Xperf123 had one final perk. While the WPA trace was made, a Perfmon trace was made of the system performance more broadly during the time. These logs are located in the C:\PerfLogs\ directory and are saved in the native Windows Performance Monitor format. So just double click the file and watch the love…

image

How’s that for a handy added bonus. It is also worth mentioning that the Perfmon trace captured has a significant number of performance counters in the categories of Memory, PhysicalDisk, Processor and System.

Conclusion and coming next…

Well! That was a long post, but that was more because of verbose screenshots than anything else.

Both Windows Performance Monitor and Windows Performance Analyser are very useful tools for developing a better understanding of disk IO patterns. While Procmon has more sophisticated filtering capabilities, WPA trumps Procmon in terms of reduced overhead (apparently 20,000 events per second is less than 2% CPU on a 2.0 GHz processor ). WPA also has the ability to visualise and drill down into the data better than Procmon can do.

Nevertheless, both tools have far more utility beyond the basic scenarios outlined in this series and are definitely worth investigating more.

In the next and I suspect final post, I will round off this examination of performance by making a few more general SharePoint performance recommendations and outlining a lightweight methodology that you can use for your own assessments.

Until then, thanks for reading…

Paul Culmsee

www.hereticsguidebooks.com



Demystifying SharePoint Performance Management Part 9 – Don’t believe everything you R/W

Hi and welcome to Part 9 (bloody hell… nine!) of my series on trying to demystify SharePoint performance management a bit. If by any chance you have been asked to provide some sizing information for your organisation and you are finding the resources online a bit overwhelming, this series is for you. If you have been a part of our varied journey so far, the last few posts have been all about Disk IO performance in the form of latency, IOPS and MBPS. In the last two articles, we have been learning about the different IO patterns that SQL Server is likely to utilise, as well as using the jackhammer known as the SQLIO utility, that is used to simulate those IO patterns on unsuspecting disk infrastructure.

Now just to set the scene for this post (and conveniently perform some product placement), I recently published a book called “The Heretics Guide to Best Practices”. Now being the author and all, I am going to suggest you buy it because it is a completely riveting read! :-).

Now apart from blatant product placement, the real reason I mention it is because one of the chapters is called “Myths, Memes and Methodologies”. In it, we examine why some ideas gain legitimacy, even though they are based on often completely dodgy foundations. I mention this here, because in terms of SQL disk IO sizing, something similar has happened with Microsoft’s published material on the topic. So the focus of this article is to finish off our discussion on understanding disk IO patterns, while lifting the lid on some of the inconsistencies in the material that that end up being repeated by SharePoint consultants as gospel to their unsuspecting disciples.

Now harking way back to part 1 to the notion of lead vs. lag indicators, our use of SQLIO thus far has essentially been used as a lead indicator. While SQLIO puts a real load on disk infrastructure and faithfully reports the resulting IOPS, latency and MBPS, the reality is it can never truly capture the nuances of a production SharePoint farm doing its thing. But in terms of a lead indicator that is okay. After all, a lead indicator by definition cannot guarantee an outcome. It can merely suggest that an outcome should be able to be met.

So while we are thinking about the lead indicator world view, some of you might have noticed that I have not yet made any suggestions what are the minimum conditions of satisfaction for disk infrastructure used to underpin SharePoint. This has been deliberate until now, because I felt that it was critical to understand the relationship between the size of a disk IO operation, and its effect on IOPS, latency and MBPS first. To that end, hopefully I have instilled a reflex in you where – if you are given an arbitrary latency, IOPS or MBPS figure that you have to meet – you immediately ask questions like, “What sort of IO patterns?” or “how large is the IO request typically going to be?” or “is the IO random or sequential?”

When whitepapers mislead…

Now we are about to get into one area where Microsoft’s published documentation is quite weak. Remember the 367 page “Capacity Planning for Microsoft SharePoint Server 2010” whitepaper? Starting at page 326, there is a section with the promising title of “Estimate Core Storage and IOPS needs” (this topic is also available separately as a technet article too). The problem is in despite that title, very little IOPS guidance actually is given. Instead the content in the section overwhelmingly speaks about estimating storage requirements. In fact the best you get is one explicit mention of IOPS in relation to the SharePoint Search service application which states the following:

The IOPS requirements for Search are significant.

  • For the Crawl database, search requires from 3,500 to 7,000 IOPS.
  • For the Property database, search requires 2,000 IOPS.

Note: For the purpose of the rest of this article, lets add the above figures together and simply say between 5,500 to 9000 IOPS for search.

Do you see the problem here? This is simply an arbitrary IOPS figure with no guidance as to the IO patterns underpinning it. What about latency or the IO request size that you need to assume? Unfortunately, no guidance is given for these questions which makes this quoted figure not overly helpful. Plus, as you will soon see below, Microsoft seemingly contradict themselves elsewhere in the same whitepaper…

So what are good numbers to use?

In the absence of any hard data, the best way to deal with storage requirements is to think in terms of lead indicators. Indicators from a lead point of view, can be framed as targets – something to aim for. Targets then can be broken down into different categories ranging from “cover your arse” to “above and beyond”:

  • Aspired target: The “this would be bloody fantastic if we could get there” target.
  • Agreed target: The “this is what we are going to deliver no matter what” target.
  • Minimum Condition of Satisfaction (MCOS) target: The “If we don’t achieve this we may as well pack up and go home” target.

So given these sorts of targets, what should the disk IO performance targets for SharePoint be? To work this out, we can utilise information already out there. Well…that is, we could if the information out there wasn’t so disparate and disconnected. So unfortunately, it takes some digging to you can find what you need.

Our first point of call in this regard is indeed Microsoft and the very same capacity planning and configuration guide that I criticised earlier for poorly dealing with IOPS. Hidden in the bowls of that document, the following statement is made on page 334 (emphasis mine):

Any storage architecture must support your availability needs and perform adequately in IOPS and latency. To be supported, the system must consistently return the first byte of data within 20 milliseconds.

So the way I look at it, a 20ms latency should be our MCOS target (see the explanation above for MCOS). If we consistently do worse than this, then we do not have a lot of assurance about the scalability of the disk IO subsystem being used for SharePoint. But like the arbitrary IOPS figure quoted in the previous section, I wonder if readers have spotted the problem with specifying this latency figure alone?

In both cases, don’t forget the almost symbiotic type of relationships between IO size, IOPS and latency. If we assumed that all IO operations were small (for example SQL’s page size of 8KB) then we could likely stay way under the 20ms limit with a more modest disk infrastructure. But to sustain the same latency with a larger IO size would require a faster disk subsystem. Why? Well as we discussed in part 6, if the size of the IO writes are larger, such as 64KB, then latency will go up because servicing larger requests takes longer than smaller ones. Therefore, if we were to assume a larger IO size, we would need more/faster disks to be able to meet the same 20ms latency KPI.

So what disk IO size should we assume to give context to a latency figure? Some insight can be found back in part 6, when we examined SQL IO characteristics and established that it will likely be much more varied than SQL’s base IO unit of 8KB pages. My suggestion therefore, is to test 8KB but also ensure that 64KB can meet the latency target. This is because 64KB represents a reasonable average size between the 8KB to 256KB range most SQL Server’s IO operations will fall within. Thus, if a SQLIO test using random read/writes at 64K indicates more than 20ms latency consistently, then you should probably ask your storage people to take another look at it.

By the way, if you really want to give your storage guys a challenge, keep jacking up the IO size!

What about aspired latency targets?

So if you are cool with the notion that the minimum condition of satisfaction for a random IO test using 64K size should be less than 20ms latency, what about aiming higher with agreed or aspirational targets?

Luckily for all of us, we can once again stand on the shoulders of giants. In this case, the Bob Duffy indirectly answers this question by providing what he considers to be the indicators for optimal SQL Server performance in general. In an excellent article with the rather appropriate title of “How to Specify SQL Storage Requirements to your SAN Dude” Bob makes the following recommendations:

  • SQL Data files must have a response time averaging about 8ms and a maximum response time of around 20ms using 64k size IOs and that are random in nature
  • SQL Log Files must have a write response time averaging from 1-5ms. use 64k IO size and are sequential in nature

The nice thing about specifying a target or benchmark like this, is that you are able to sidestep discussions on RAID levels, stripe sizes and many other things that SAN nerds find interesting. We keep things focused on the lead indicators and in effect state “If you can meet these figures, configure it any way you like.” This gives the SAN guys the freedom to do their job, while giving you an indicator that can give you confidence in the disk infrastructure. So if we were to distil the figures above into lead indicator targets for storage gurus, it might look something like this:

  • MCOS target: Less than 20ms latency for random IO requests of 64KB
  • Agreed target: Average 8ms latency for random IO requests of 64KB with no more than 20ms max latency. Less than 5ms latency for sequential log IO
  • Aspired target: No more than 8ms latency for random SQL IO requests of 64KB and average of 1ms latency sequential log IO with max never going above 5ms

Now in the ProData article, Bob made a slightly tongue in cheek point that sums up the above thinking really well, as well as giving insight to a critical aspect we have not considered so far…

Nowadays most SQL consultants try and not talk about RAID types and types of disk, it can be best to leave that up to the storage guys. If the storage team can meet my requirement for 5,000 random 64k read/write IOPs at 8ms latency by using 50 old SATA drives at 5,400 rpm in RAID 5 then knock yourself out – I’m happy. Well maybe I’m happy till we have that chat about Service Level requirements during a disk degrade event but that’s a different story…

If you look closely at Bob’s quote, you will see that he has also specified the last critical variable in the mix. Bob’s mention of “5000 random 64k read/write IOPS” is in reference to another point he makes. Without an IOPS figure to work from, the targets we have come up with are effectively meaningless. Quoting Bob:

The main thing to specify apart form your latency requirement is the throughput (IOPs). It is no good meeting the 8ms target for 100 IOPs and then finding your workloads needs 5,000 IOPs. You wont be able to meet the 8ms target!!

Consider it this way… a SharePoint site that services 100,000 users, will process a lot more IO requests than a site that services 10 users. With the latter, it is quite likely that the latency targets we have been talking about (even the aspirational ones) would be pretty easy to meet with a single disk. (To hark back to our shopping centre metaphor, one check out operator is all that is needed at a corner store, whereas many are needed at the supermarket). This is presumably why Bob has used a figure like 5000 IOPS for his post. It is probably a figure that conveniently represents some fairly heavy disk usage. But it does beg two question:

  • How much IOPS should we use to simulate SharePoint IOPS?
  • In the absence of anything else, perhaps 5000 IOPS is a good figure to go with?

Don’t believe all you read…

Now if you go back and read the start of this post, you will recall I mentioned that Microsoft stated some IOPS figures for the SharePoint search application databases ranged between 5,500 to 9000. That would indicate that Bob’s base figure of 5000 is a bit low, especially given that SharePoint has many other components beyond search that have not been taken into account. So to put Bob’s 5000 IOPS figure in perspective, let’s re-examine Microsoft’s trusty capacity planning whitepaper. One of the great things about this document is that Microsoft detailed the performance stats of a typical day in the life of their internal SharePoint environments. Since Microsoft are so large, they have different SharePoint farms for different collaborative scenarios. The scenarios they covered were:

  1. Enterprise Intranet environment (also described as published intranet). In this scenario, employees view content like news, technical articles, employee profiles, documentation, and training resources. It is also the place where all search queries are performed for all of the other the SharePoint environments within the company.
  2. Enterprise intranet collaboration environment (also described as intranet collaboration). In this scenario, is where important team sites and publishing portals are housed. They are typically used for enterprise collaboration, organizations, teams, and projects. Sites created in this environment are used as communication portals, applications for business solutions, and general collaboration. No custom code runs in these sites.
  3. Departmental Collaboration environment. In this scenario, employees use this environment to track projects, collaborate on documents, and share information within their department.
  4. Social Collaboration Environment. This is the My Sites scenario. These connect employees with one another and the information that they need. Employees use this environment to present personal information such as areas of expertise, past projects, and colleagues to the wider organization. The environment also hosts personal sites and documents for viewing, editing, and collaboration.

Now reading about these scenarios is highly interesting and Microsoft provides some nice nuggets of information that we will use in a future post. But for now I will stick purely to a disk IOPS perspective. To that end, below are a few fun-filled facts about the number of users in each of the four scenarios:

  1. Enterprise Intranet environment:  33580 unique users per day, with an average of 172 concurrent and a peak concurrency of 376 users.
  2. Enterprise intranet collaboration environment: 69702 unique users per day, with an average of 420 concurrent users and a peak concurrency of 1433 users
  3. Departmental Collaboration environment. 9186 unique users per day, with an average of 189 concurrent users and a peak concurrency of 322 users
  4. Social Collaboration Environment. 69814 unique users per day, with an average of 639 concurrent users and a peak concurrency of 1186 users

So now you have a sense of the size of these scenarios and as an added bonus, gotten a glimpse into the difference that usage patterns can make. For example: social collaboration and enterprise collaboration have similar number of unique users but social has more average concurrency but less peak. But what about IOPS?

In the document, IOPS is split into reads per second and writes per second, so I added them to estimate IOPS. The results are rather surprising…

Metric

Social Collaboration

Departmental Collaboration

Published intranet

Intranet Collaboration

Unique visitors

69814

9186

33580

69702

Average concurrent

639

189

172

420

Max concurrent

1186

322

376

1433

IOPS

941

74

409.66

409.66

Now while it might be tempting to ponder why social collaboration has over double the IOPS, yet half the concurrency of enterprise intranet collaboration, we are not going to worry about here. Besides, we actually covered some of it already when we used logparser to get insights of usage patterns. What I will instead do is draw your attention to is the fact that that none of the IOPS scenarios come anywhere near the 5000 IOPS figures cited by ProData or Microsoft’s 5500-9000 IOPS cited for search (in the very same capacity planning document I might add!)

So something is amiss. If an organisation the size of Microsoft can have almost 70000 unique users per day, with a peak concurrency of 1433 users and only total 410 IOPS, then where the hell did the 5500-9000 IOPS figure for search alone come from? Even if you take the scenario with the highest IOPS (the Social collaboration scenario with 941 IOPS), that’s still less than one fifth 5500 IOPS which was at the low end of the search IOPS figure.

Now I am also suspicious that two different case studies have the exact same IOPS figure. If you compare the “published intranet” scenario with the “intranet collaboration” scenario, one has half the visitors, yet both have precisely the same IOPS (right down to decimal places). That seems highly unlikely to me and I suggest that a mistake has been made. Given the intranet collaboration has the highest max concurrency figure, I would have expected IOPS to be a higher than it is. Hmmm…

What can we take away from this? For one, the capacity planning document could seriously do with a rewrite in this area. Secondly, I don’t have a lot of faith in those IOPS figures quoted (although I have more confidence in the case studies that the arbitrary figures specified for search).

So if we put aside the doubt created by the issues with the capacity planning guide, there is one really interesting fact that remains… none of the reported IOPS figures came anywhere near 5000 IOPS.

Insights from HP…

It turns out that Hewlett Packard also did some load testing of SharePoint 2010 (among other things) and published a whitepaper called the “HP performance and configuration guide for Microsoft SharePoint 2010“.  In this guide, they detail the results of a scenario they tested based on what they termed an “Enterprise Workload”. The guide covers definition of enterprise workload in loving detail, but the gist of it is that it covers the following areas:

  • Document Center (30% of operations) Check-out, download, upload and check-in documents
  • Team Sites – (20% of operations) work with calendars, discussions and documents
  • Portal SItes – (20% of operations) work with event, announcements and surveys
  • My Sites – (10% of operations) work with documents in personal documents library
  • Search – (20% of operations) Submit searches with random word or phrases

HP then simulates 500 concurrent users performing the actions above. In Table 13 of the report (page 28 of their document and reproduced below) , HP outline the performance and even break down the IO characteristics of each SharePoint database (which is really handy indeed). Adding up the last column of transfers/sec (which is essentially IOPS) we get a result of 1347.33 IOPS.

Thus we are still considerably under the 5000 IOPS that Bob Duffy suggests.

Conclusion…

Right! Remember our discussion above on MCOS, agreed and aspired targets? For an aspirational target, I think that we can reasonably use 5000 IOPS as a starting point for an enterprise organisation of Microsoft’s size. If we stick with 5000 IOPS, then my suggestion for an aspirational latency target would be:

  1. no more than 8ms latency for random SQL IO requests of 64KB
  2. average 1ms (and no more than 5ms max) latency of sequential log IO of 64KB

I think these figures are a pretty good test of a disk subsystem and think that Bob at ProData is therefore pretty close to the mark. Of course, you can use these figures to make your own judgement and adjust accordingly. Provided that you think of them as lead indicators that provide you a level of confidence in your disk infrastructure, you now have the tools and knowhow to run the tests too.

So if there was a moral of the story to this post, it would be to not believe everything you read and always verify espoused reality with actual reality via testing. On that note, the next post will finish off our examination of disk performance by going over 2 additional tools that I think are particularly good for testing assumptions. After that, we will be revisiting Microsoft’s case studies, as well as some findings, insights and recommendations from some additional lab scenarios that Microsoft conducted.

Thanks for reading

Paul Culmsee

www.sevensigma.com.au



Demystifying SharePoint Performance Management Part 8 – More on SQL and SQLIO

Well here we are at part 8 of my series on making SharePoint performance management that little bit easier to understand. What is interesting about this series is its timing. If by some minute chance that the marketing tsunami has passed you by at the time I write this, SharePoint 2013 public beta was released. Much is being made about its stated requirement of 24GB of RAM for a “Single server with a built-in database or single server that uses SQL Server”. While the reality is that requirements depends on what components that you are working with, this series of articles should be just as useful in relation to SharePoint 2013 as for any other version.

Now, if you have been following events thus far, we have been spending some time examining disk performance, as that is a very common area where a sub optimal configuration can result in a poor experience for users. In part 6, we looked at the relationship between the performance metrics of disk latency, IOPS and MBPS. We also touched on the IO characteristics (nerd speak for the manner in which something reads and writes to disk) of SQL Server and some SharePoint components. In the last post, we examined the windows performance counters that one would use to quickly monitor latency and IOPS in particular. We then finished off by taking a toe dip into the coolness of the SQLIO utility, that is a great tool for stress testing your storage infrastructure by pulverising it with different IO read and write patterns.

In this post, we will spend a bit of time taking SQLIO to the next step and I will show you how you can run a comprehensive disk infrastructure stress test. Luckily for the both of us, others have done the hard work for us and we can reap the benefits of their expertise and insights. First up however, I would like to kick things off by spending a little time showing you the relationship between SQLIO results and performance monitor counters. This helps to reinforce what the reported numbers mean.

Performance Monitor and SQLIO

In the previous post when we used Windows Performance Monitor, we plotted IOPS and Latency by watching the counters as they occurred in real-time. While this is nice for a quick analysis, nothing is actually stored for later analysis. Fortunately, performance monitor has the capability to run a trace and collect a much larger data set for a more detailed analysis later. So first up, lets use performance monitor to collect disk performance data while we run a SQLIO stress test. After the test has been run, we will then review the trace data and validate it against the results that SQLIO reports.

So go ahead and start up performance monitor (and consult part 7 of this series if you are unsure of how to do this). Looking at the top left of the performance manager, you should see several options listed under “Performance”. Click on “Data Collector Sets” and look for a sub menu called “User Defined”. Now right click on “User Defined” and choose “New –> Data Collector Set” as shown below:

image

This will start a wizard that will ask us to define what performance counters we are interested in and how often to sample performance. I have pasted screenshots of the sequence below (click to enlarge any particular one). First up we need to give a name to this collection of counters and as you can see below, I called mine “Disk IO Experiments”. Once we have given it a name, we have to choose the type of performance data we want to collect. Tick the “Performance counter” button and ensure the others are left unticked.

image  image

Next we need to pick what specific counters we need. We will use the same counters that we used in part 7, with the addition of two additional ones. To remind you of part 7, the counters we looked at were:

  • Avg. Disk sec/Read   – (measures latency by looking at how long in seconds, a read of data from the disk took)
  • Avg. Disk sec/Write  –  (measures latency by looking at how long in seconds, a write of data to the disk took)
  • Disk Reads/sec  –  (measures IOPS by looking at the rate of read operations on the disk per second)
  • Disk Writes/sec  – (measures IOPS by looking at the rate of write operations on the disk per second)

In addition to these counters, we will also add two more to the collector set

  • Avg. Disk Bytes/Read – (Measures size of each read request by reporting the number of bytes each used)
  • Avg. Disk Bytes/Write – (Measures size of each write request by reporting the number of bytes each)

We will use these counters to see if the size of the IO request than SQLIO uses is reported correctly.

Depending on your configuration, choose the PhysicalDisk or LogicalDisk  performance object (consult part 7 for the difference between PhysicalDisk and LogicalDisk). You will then find the performance counters I listed above. Before you do anything else, make sure that you pick the right disk or partition from the “Instances of selected object” section. We need to specifically pick the disk or partition that SQLIO is going to stress test. Now you select each of the aforementioned six performance counters and click the “Add” button. Finally, make sure that you pick the sample interval to be 1 second as shown below. This is really important because it makes it easy to compare to SQLIO which reports on a per second basis.

image  image

At this point you do not need to configure anything else, so click the “Finish” button, rather than the “next” button, and the collector should now be ready to go. It will not start by default, but since there is no fun in that, let’s collect some data. Right click on your shiny new data collector set and choose “Start”.

image  image

Once started, performance monitor is collecting the values of the six counters each and every second. Now let’s run a SQLIO command to give it something to measure. In this example, I am going to run SQLIO with random 8KB writes. But to make it interesting, I will use two threads and simulate 8 outstanding IO requests in the queue. If you recall by grocery check-out metaphor of part 6, this is like having 8 people with full shopping carts waiting in line for a single check-out operator. Since the guy at the back of the line has to wait for the seven people in front of him to be processed, he has to wait longer. So with eight outstanding IO requests, latency should increase as each IO request will be sitting in a queue behind the seven other requests.

By the way, if none of that made sense, then you did not read part 6 and part 7. I urge you to read them before continuing here, because I am assuming prior knowledge of SQLIO and disk latency characteristics and the big trolley theory..

Here is the SQLIO command and below is the result…

SQLIO –kW –b8 –frandom –s120 –t2 –o8 –BH –LS F:\testfile.dat

image

Now take a note of the results reported. IOPS was 526, MBs/sec was 4.11 and as expected, the average latency was much larger than the SQLIO tests we ran in part 7. In this case, latency was 29 milliseconds.

Let’s now compare this to what performance monitor captured. First up, return to Performance Monitor, and stop your data collector set by right clicking on it and choosing “Stop”. Now if you cast your eye to the top left navigation pane, you should see an option called “Reports” listed under “Performance”. Click on “Reports” and look for a sub menu called “User Defined”. Expand “User Defined” and hey presto! Your data collector set should be listed…

image  image

Expand the data collector set and you will find a report for the data you just collected. The naming convention is the server name and the date of the collection. Click on this and you will then see the performance data for that collection in the right pane. At the bottom you can see the six performance counters we chose and just by looking at the graph, you can clearly see when SQLIO started and stopped.

image

Now we have to do one additional step to make sure that we are comparing apples with apples. Performance monitor will calculate its averages based on the total time displayed. As you can see above, I did not run SQLIO straight away, but the performance counters were collected each second nonetheless. Therefore we have a heap of zero values that will bring the averages down and mislead. Fear not though, it is fairly easy (although not completely obvious) to zoom into the time we are interested in. If you look closely, just below the performance graph, where the time is reported, there is a sliding scale. If you click and drag the left and right boxes, you can highlight a specific time you are interested in. This will be shown in the performance graph too, so using this tool, we can get more specific about the time we are interested in. Then in the toolbar above the graph, you will see a zoom button. Click it and watch the magic…

image  image

As you can see below, now we are looking at the performance data for the period when the SQLIO was run. (Now it should be noted that windows performance monitor isn’t particularly granular here. I had to fiddle with the sliding scale a couple of times to accurately set the exact times when SQLIO was started and then stopped.)

image

Now let’s look at the results reported by performance monitor. The screenshot above is looking at the number of Disk Writes per second. Let’s zoom into the figures for the time period and example the average result over the sample period. To save you squinting, I have pasted it below and called out the counter in question. Performance monitor has reported average “Disk Writes/Sec” as 525.423. This is entirely consistent with SQLIO’s reported IOPS of 526.

image

Latency (reported in seconds via the counter Avg. Disk sec/Write) is also fairly consistent with SQLIO. The figure from performance monitor was 0.03 seconds (30 milliseconds). SQLIO reported 29 milliseconds.

image

What about IO size? Well, that’s what Avg disk bytes/write is for… Let’s take a look shall we? Yup.. 8192 kilobytes, which is exactly the parameters specified.

image

SQL IO characteristics revisited (and an awesome script)

Now that we understand what SQLIO is telling us via examining windows performance monitor counters, I’d like to return to the topic of SQL IO patterns. Back at the end of part 6, I spent some time talking about SQL and SharePoint IO characteristics. As a quick recap, I mentioned SQL reads and writes to databases via 8KB pages. Now based on me telling you that, you might assume that if you had to open a large document from SharePoint (say 1MB  or 1024KB), SQL would make 128 IO requests of 8KB each.

While that would be a reasonable assumption, its also wrong. You see, I also mentioned that SQL Server also has a read-ahead algorithm. This algorithm means that means SQL will try and proactively retrieve data pages that are going to be used in the immediate future. As a result, even though a single page is only 8KB, it is not unusual to see SQL read data from disk in a much wider range if it thinks the next few 8KB pages are likely to be asked for anyway. Now as an aside, if you are running SQL Enterprise edition, the possible read-ahead range is from 1 to 128 pages (other editions of SQL max out at 32 pages). Assuming SQL Enterprise edition, this translates to between 8KB and 1024KB for a single IO operation. Think about this for a second… based on the 1MB document example I used in the previous paragraph, it is technically possible that this could be serviced with a single IO request by an enterprise edition of SQL server.

Okay, so essentially SQL has varying IO characteristics when it comes to reading from and writing to databases. But there is still more to it. This is because there are a myriad of SQL IO operations that we did not even consider in part 6. As an example, we have not spoken about the IO characteristics of how SQL writes to transaction logs (which is sequential as opposed to random IO, and does not use 8k pages at all). Another little known fact with transaction logs is that SQL has to wait for them to be “flushed to stable media” before the data page itself can be flushed to stable media. This is known as Write Ahead Logging and is used for data integrity purposes. What is means though is that if logging has a lot of latency, the rest of SQL server can potentially suffer as well (and if it was not obvious before, yet another good reason why people recommend putting SQL data and log files on different disks).

Now I am not going to delve deep into SQL IO patterns any more than this, because we are now getting into serious nerdy territory. However what I will say is this: by understanding the characteristics of these IO patterns, we have the opportunity to change the parameters we pass to SQLIO and more accurately reflect real-world SQL characteristics in our testing. Luckily for all of us, others have already done the hard work in this area. First up, Bob Duffy created a table that summarises SQL Server IO patterns based on the type of operations being performed. Even better than that… Niels Grove-Rasmussen wrote a completely brilliant post, where not only did he list the IO patterns that SQL is likely to exhibit, he wrote a PowerShell script that then runs 5 minute SQLIO simulations for each and every one of them!

I have not pasted the script here, but you will find it at Niels article. What I will say though is that aside from the obvious 8KB random reads and writes that we have concentrated on thus far, Niels listed several other common SQL IO patterns that his SQLIO script tests:

  • 1 KB sequential writes to the log file (small log writes)
  • 64 KB sequential writes to the log file (bulk log writes)
  • 8 KB random reads to the log file (rollbacks)
  • 64 KB sequential writes to the data files (checkpoints, reindex, bulk inserts)
  • 64 KB sequential reads to the data files (read-ahead, reindex, checkdb)
  • 128 KB sequential reads to the data files (read-ahead, reindex, checkdb)
  • 128 KB sequential writes to the data files (bulk inserts, reindex)
  • 256 KB sequential reads to the data files (read-ahead, reindex)
  • 1024 KB sequential reads to the data files (enterprise edition read-ahead)
  • 1 MB sequential reads to the data files (backups)

The script actually handles more combinations than those listed above because it also tests for differing number of threads (-t ) and outstanding requests (-o ). All in all, over 570 combinations of IO patterns are tested. Be warned here… given that each test takes 5 minutes to run by default, with a 60 second wait time in between each test, be prepared to give this script at least 2 days to let it run its course!

The script itself is dead simple to run. Just open a powershell window, and save Niels script to the SQLIO installation folder. From there, change to that directory and issue the command:

./SQLIO_Batch.ps1

Then come back in 3 days! Seriously though, depending on your requirements, you can modify the parameters of the script to reduce the number of scenarios based on editing the first 7 lines of code which is quite self explanatory

$Drive = @('G', 'H', 'I', 'J')
$IO_Kind = @('W', 'R') # Write before read so that there is something to read.
$Threads = @(2, 4, 8)
#$Threads = @(2, 4, 8, 16, 32, 64)
$Seconds = 10*60 # Five minutes
$Factor = @('random', 'sequential')
$Outstanding = @(1, 2, 4, 8, 16, 32, 64, 128)
$BlockSize = @(1, 8, 64, 128, 256, 1024)

Now if this wasn’t cool enough, Niels also written a second script that parses the output from all of the SQLIO tests. This can produce a CSV file that allows you to perform further analysis in excel. To run this script, we need to know the same of the output file of the first script. By default the filename is SQLIO_Result.<date>.txt. For example:

./SQLIO-Parse.ps1 -ResultFileName ‘SQLIO_Result.2010-12-24.txt’

By default the parse script outputs to the screen, but modifying it to write to CSV file is really easy. All one has to do is comment out the second last line of code and uncomment the last one as shown below:

#$Sqlio | Format-Table -Property Kind,Threads,Seconds,Drive,Stripe,Outstanding,Size,IOs,MBs,Latency_min,Latency_avg,Latency_max -AutoSize

$Sqlio | Export-Csv SQLIO_Parse.csv

Below is an example of the report in Excel. Neat eh?

image

Conclusion and coming up next…

By now, you should be a SQLIO guru and have a much better idea of the sort of IO patterns that SQL Server has beyond just reading from and writing to databases. We have covered the IO patterns of transaction logs, as well as examined a terrific PowerShell script that not only runs all of the IO scenarios that you need to, but parses the output to produce a CSV file for deeper analysis. In short, you now have the tools you need to run a pretty good disk infrastructure stress test and start some interesting conversations with your storage gurus.

However at this point I feel there some pieces missing to this disk puzzle:

  1. We have not yet brought the discussion back to lead and lag indicators. So while we know how to hammer disk infrastructure, how can we be more proactive and specify minimum conditions of satisfaction for our disk infrastructure?
  2. Microsoft treatment of disk performance (and in particular IOPS and latency) in their performance documentation is inconsistent and in my opinion, confuses more than it clarifies. So in the next post, we are going to look at these two issues. In doing so, we are going to leave SQLIO and Performance Monitor behind and examine two other utilities including one that is lesser known, but highly powerful.

Until then, thanks for reading

Paul Culmsee

www.hereticsguidebooks.com



Demystifying SharePoint Performance Management Part 7 – Getting at Latency, IOPS and MBPS

Hi all, and welcome to part 7 of this series on SharePoint performance planning. This is the point of the series where I realise that I have much more to write about than I intended. Last time this happened I never got around to finishing the series (*blush* … a certain tribute to a humble leave form ). Like that series, I now have no idea how many posts I will end up doing, but I will keep soldiering on nonetheless.

Recapping the last two posts of this series in particular, we have been looking at the relationship between the performance measures of Disk latency, Disk I/O per second (IOPS) and Disk Megabytes transferred per second (MBPS). We spent most of part 6 looking at the relationship between these three performance metrics by specifically focusing on how the size of an IO request affects things. If you recall, a couple of key points were made:

  • In general, the larger the IO request being made, the more latency there will be, resulting in less IOPS but increased MBPS.
  • Latency is significantly affected by whether an IO requests is sequential or random. To demonstrate this, I used a tool called SQLIO to simulate disk IO to generate some performance stats that demonstrated both IOPS and MBPS improved by some 750% when compared to random IO.

We finished the post by examining the way SQL server performs IO requests and what SharePoint components are IOPS heavy. In short, SQL Server uses a range of request sizes for database reads and writes between 8K and 1024KB. The reason for the range (for reads anyhow) is the read-ahead algorithm (gory detail here), in which SQL attempts to proactively retrieve data that are going to be used in the immediate future. A read-ahead may result in a much larger I/O request being made than a single 8KB page, but much better performance because in effect, SQL is pulling more data from each I/O operation.

In this episode (and the next one)…

Our focus in this post and the next one is similar to part 3, in that we are now going to do some real work and some of it will involve the command line. Therefore also like part 3, if you are one of those project manager types who utilise the wussy “I’m business, not technical” excuse, I want you to persist and try this stuff out. Given that I wrote this series with you in mind, put that damn iPad down, get out your laptop and reload this article! You can try all of the steps below out on your PC while you are reading this.

Now for the tech types reading this, on account of my intention to “demystify” SharePoint performance, I will be more verbose that what you guys need. But consider it this way – I am doing you guys a favour because next time your PM or BA’s eyes start to glaze when you start explaining performance and capacity planning to them, you can point them to this series and tell them that there is no excuse.

This article is going to cover two areas. First up let’s look at what we can do with Windows inbuilt Performance Monitor tool in terms of monitoring Latency and IOPS in particular. Next we will look at a popular tool for stress testing disk infrastructure that gives us visibility into MBPS.

The basics: Performance Monitor 101

Just in case you have never done it before, type in PERFMON on any Windows box at the start button or the command line (by the way, I am assuming Windows 7 or Windows 2008 Server here).

image

If you did that, then you are looking at the classic tool used to understand how a PC or server is performing. Looking at the top left of the resultant window, you should see several options listed under “Performance”. Click on “Performance Monitor” and watch the magic. Congratulations… you now know how to measure CPU as that is the default performance counter displayed.

image

Performance monitor can easily be used to take a look at disk IOPS and latency. Right click on the graph and from the menu choose Add Counters… This will provide you with a long list of “performance objects” (a fancy word for a logical grouping of performance counters)

image

From the list of performance objects, scroll up and find “LogicalDisk”. Move your cursor to the arrow to the right of the LogicalDisk counters and click on it. You should see a list of disk related performance counters appear as shown below.

image   image

Note:  You could have chosen the performance object called PhysicalDisk instead of LogicalDisk. The difference between them is that physical disk counters only consider each hard drive, not the way it is partitioned. The Logical Disk performance object monitors logical partitions of a disk. As a general role (for non techy types reading this), go with LogicalDisk.

Right then… now currently, all of the possible performance counters for LogicalDisk are currently selected, but for now we are only interested in latency and IOPS, which are represented by four counters:

Latency: Avg. Disk sec/Read
Avg. Disk sec/Write
Measures the average time, in seconds, of a read of data to the disk. (Therefore 5ms will be shown as 0.005)
Measures the average time, in seconds, of a write of data to the disk

MS Technet Note: Numbers also vary across different storage configurations (SAN cache size/utilization can impact this greatly)
IOPS Disk Reads/sec:
Disk Writes/sec:
The rate of read operations on the disk per second.
The rate of write operations on the disk per second.
MS Technet Note: This number varies based on the size of I/O’s issued. Practical limit of 100-140/sec per disk spindle, however consult with hardware vendor for more accurate estimation.

Go ahead and select these four counters (use the Ctrl key and click each one to select more than one counter). Now we have to choose which disk or partition that we want to monitor. Below where you chose the performance counters, you will see a label with the suitably unclear title of “Instances of selected object” (I have highlighted it below). From here, choose the hard drive or partition you are interested in. Finally, click the “Add” button at the very bottom and you should see your selected counters listed in the “Added counters” window.

image   image

Click the OK button and you should now be seeing these counters doing their thing. Each performance counter you added is listed below the graph showing the performance data collected in real time. The display shows a time period of 100 seconds and is refreshed each second. Also, a neat feature that some people don’t know about it is to click on one of the counters and then hold down Ctrl and type the letter “H”. This is the shortcut key for highlighting the selected counter and the currently selected counter should now be black. Additionally, you should be able to now use the up and down arrow keys to cycle through the counters and highlight each.

image

At this point, try copying some files or open some applications and watch the effect. You should see a spike in disk related activity reflected in the IOPS and latency counters above. There you go business analysts, you officially have monitored disk performance! Wasn’t so hard was it?

Now that we are monitoring some interesting counters, how about we really give the disk something to chew on! Smile

Upping the ante with SQLIO

SQLIO is an old tool nowadays, but still highly relevant and extremely useful. Despite being named SQLIO, it actually has very little to do with SQL Server! It was provided by Microsoft to help determine the I/O capacity that a server can handle. SQLIO allows you to test a combination of I/O sizes for read/write operations, both sequentially and randomly. Thus, it is useful for stress testing the disk infrastructure for any IO intensive application. Now be warned… you can absolutely smash your disk infrastructure with this tool, so don’t go running this in production without some sort of official clearance. Furthermore, if you want to use SQLIO to test your SAN, be sure to consider the other servers and applications that might be using it. There is potential to adversely affect them.

You can download SQLIO from Microsoft here. It will run on any recent Windows OS, so you can try it on your own PC (now you know why I told you to put your iPad away earlier). Installing SQLIO is very simple, just run SQLIO.MSI and it will install by default into C:\Program Files(x86)\SQLIO folder.

Note: If you want a great tutorial on installing and using SQLIO, look no further than MCM Brent Ozar’s 2009 article entitled SQLIO Tutorial: How to Test Disk Performance).

SQLIO works by reading from and writing to one or more test files, so the first thing we need to do with SQLIO is to set up a configuration file that specifies the location and size of these test files. The configuration file is called PARAM.TXT and is found in the installation folder. Each line of the configuration file represents a test file, its size and a couple of other parameters. The options on each line of the param.txt file are as follows:

  • <Path to test file> Full path and name of the test file to be used.
  • <Number of threads (per test file)>
  • <Mask > Set to 0x0
  • <Size of test file in MB> Ideally, this should be large enough so that the test file will be larger than any cache resident on the SAN (or RAID controller).

Of these 4 parameters, only the first one (the location of the file) and last one (the size of the file) matters for now. Below is a sample param.txt that tests a 20GB file on the E:\ Drive.

image

The next step is to run a  quick SQLIO using sequential writes to create the test file. We are going to use the command-line to do this (although someone has written a GUI for the tool). So open a command prompt, change to the installation directory for SQLIO and type the command below (we will save an detailed explanation of the parameters for later).

sqlio -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt timeout /T 10

This command will create the file and run a 10 second test. The output will look something like what I have pasted below:

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

parameter file used: param.txt

     file e:\testfile.dat with 1 thread (0) using mask 0x0 (0)

1 thread writing for 10 secs to file e:\testfile.dat

     using 8KB sequential IOs

     enabling multiple I/Os per thread with 8 outstanding

size of file e:\testfile.dat needs to be: 20971520000 bytes

current file size:      104857600 bytes

need to expand by:      20866662400 bytes

expanding e:\testfile.dat …

SQLIO will stop here for a while, while your PC chugs away creating the 20GB test file. Once completed, it will run out quick 10 second test, but you can ignore the rest of the output because this test is  of no consequence.

Running a real test

The previous command was just the entre. We are not interested in the resulting data because the point of the exercise was to create the test file. Now it is time for the main course. Try this command. It will spend 2 minutes running a random IO write to the 20gig test file using a size of 8KB for each write.

sqlio -kW -b8 -frandom -s120 -BH -LS -Fparam.txt

Below is the output that summarises the configuration specified by the above command:

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

1 thread writing for 120 secs to file e:\TestFile.dat

using 8KB random IOs

buffering set to use hardware disk cache (but not file cache)

using current size: 20000 MB for file: e:\TestFile.dat

initialization done

For the next two minutes SQLIO will chug away, hammering the disk with writes. Once the test has been performed, SQLIO will report its findings. You will see IOPS, MBPS and a report of average/max/min latency. On top of this, a histogram showing the distribution of latency is provided. This histogram gives context to the “average latency” figure because it shows the shape of the latency that occurred throughout the test. I graphed the distribution in excel below the SQLIO results below:

CUMULATIVE DATA:

throughput metrics:

IOs/sec:   225.80

MBs/sec:     1.76

latency metrics:

Min_Latency(ms): 0

Avg_Latency(ms): 3

Max_Latency(ms): 111

histogram:

ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+

%:  4  6  6 31 23 15  5  3  2  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0

image

Running the numbers…

Now, before we get into a more detailed test, let’s examine some of the SQLIO parameters:

  • -k specifies whether to perform a read or write test (–kW for write and –kR for read)
  • -s specifies how long to run the test for. In the example above it ran for 2 minutes (120 seconds)
  • -f specifies whether to run a random or sequential IO operation (-frandom)
  • -b specifies the size of the IO operations (in the example above 8KB)
  • -t specifies the number of threads to use. A multi-cpu server should be able to utilise more threads than you have processors. If your storage can handle it, we can increase the number of threads and see what latency arises as a result.
  • -o specifies the number of outstanding requests. This simulates a sudden spike in load and gives an indication of how fast IO requests are being serviced. If you keep adding outstanding requests, latency will start to increase as the number of IO requests outstrips the disks ability to service them.
  • -LS means to capture the disk latency information. If you do not specify this you will not get any latency results

Okay, so how about we see what difference a queue of IO requests makes. Below is a SQLIO command with the addition of the –o parameter. Let’s see what a queue of 4 outstanding requests does and compare the historgram output…

sqlio -kW -b8 -frandom –s120 –o4 -BH -LS -Fparam.txt

And the result? Much more latency than our first example above but no real increase in IOPS or MBPS. Clearly we are already at the limit of what my laptop can handle (I stripped the hyperbole and pasted the counters only).

IOs/sec:   221.73

MBs/sec:     1.73

Min_Latency(ms): 0

Avg_Latency(ms): 17

Max_Latency(ms): 187

 

image

Now since I only changed 1 parameter and such a difference was seen, most people will use SQLIO with a batch file to test different parameters. For example, if you were to paste the commands below into a batch file, you would be running write tests using 16KB, 32KB and 64KB sizes.

sqlio -kW -b16 -frandom -s120 -BH -LS -Fparam.txt

sqlio -kW -b64 -frandom -s120 -BH -LS -Fparam.txt

sqlio -kW -b128 -frandom -s120 -BH -LS -Fparam.txt

For what it’s worth, here is the results for each of the above tests (including the 8KB one we stared with) showing the relationship of IOPS, MBPS and latency. As predicted by our exploration of the relationship between request size, IOPS and MBPS in part 6, latency was smallest with the 8KB option.

8KB write 16KB write 64KB write 128KB write
IOs/sec: 225.80

MBs/sec: 1.76

Avg_Latency(ms): 3

IOs/sec: 220.39

MBs/sec: 3.44

Avg_Latency(ms): 4

IOs/sec: 192.85

MBs/sec: 12.05

Avg_Latency(ms): 4

IO/Sec: 176.30

MB/sec: 22.02

Avg Latency(ms): 5

image

Now one quick note: If you want to play with the –t parameter and add more threads, you will  have to reference the test file directly and not refer to the parameters file. This is because the one of the settings in the param.txt file is the number of threads for each file. Not matter what you put in at the command line, it will be overwritten by what is specified in param.txt. Thus the command below would only run a single thread despite 8 threads being specified via the –t parameter.

sqlio -kW -b64 -frandom -s120 -t8 -o1 -BH -LS -Fparam.txt

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

parameter file used: param.txt

file c:\testfile.dat with 1 thread (0) using mask 0x0 (0)

 

To get around this issue, drop the –F parameter and refer to the test file directly as shown below:

sqlio -kW -b64 -frandom -s120 -t8 -o1 -BH -LS e:\testfile.dat

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

8 threads writing for 120 secs to file e:\testfile.dat

 

Conclusion (and coming up next)…

Phew! Okay, so apart from possibly whetting your appetite for smashing disk infrastructure, you might have also come to the realisation that there are many parameters to test in various combinations. In this entire article, I have assumed random writes to the disk, but what about sequential writes? For that matter, what about reads? What about multiple threads and more outstanding requests? What about longer tests or different sized test files?

These are all important questions and I will answer them… in the next post or two. This one is getting a little too long and I have plenty more to cover in this area.

So have a play with the parameters on SQLIO on your own hardware and in the next post, we will continue looking at SQLIO, plus some great work people have done to make your life much easier using it. I want to also return to PERFMON to show you the relationship between the PhysicalDisk and LogicalDisk counters and what SQLIO reports. Then we will examine two other tools, including one that is lesser known, but a very powerful way to measure disk performance. (That one will redeem me with the tech guys who will have no doubt found this article to be too light on 🙂

Subsequent to that, we hark way back to part 1 and return to a lead indicator point of view of disk IO performance and look at how you can nail the ass off your SAN vendor to ensure they do all the due diligence necessary that your Disk infrastructure will perform well.

Until then, thanks for reading

Paul Culmsee

HGBP_Cover-236x300

www.sevensignma.com.au



Demystifying SharePoint Performance Management Part 6 – The unholy trinity of Latency, IOPS and MBPS

Hi all

Welcome to part 6 on my series in making SharePoint performance management that little more digestible. To recap where we have been, I introduced the series by comparing lead versus lag indicators before launching into an examination of Requests Per Second (RPS) as a performance indicator. I spent 3 posts on RPS and then in the last post, we turned our attention to the notion of latency. We watched a Wiggles Video and then looked at all of the interacting components that work together just to load a SharePoint home page. I spent some time explaining that some forms of latency cannot be reduced because of the laws of physics, but other forms of latency are man made. This is when any one of the interacting components are sub-optimally configured and therefore introduce unnecessary latency into the picture. I then asserted that disk latency was one of the most common area that is ripe for sub-optimal configuration. I then finished that post by looking at how a rotational disk works, the strategies employed to mitigate latency (Cache, RAID, SAN’s etc.)

Now on the note of Cache, RAID and SAN’s Robert Bogue who I mentioned in part 1, has also just published an article on this topic area called Computer Hard Disk Performance – From the Ground Up. You should consider Robert’s article part 5.5 of this series of posts because it expands on what I introduced in the last post and also spans a couple of the things I want to talk about in this one (and goes beyond it too). It is an excellent coverage of many aspects of disk latency and I highly recommend you check it out).

Right! In this post, where will look more closely at latency and understand its relationship with two other commonly cited disk performance measures: IOPS and MBPS. To do so, lets go shopping!

Why groceries help to explain disk performance

image

Most people dislike having to wait in a line for a check-out at a supermarket and supermarkets know this. So they always try and balance the number of open check-out counters so that they can scale when things are busy, but not pay the operators to standing around when its quiet. Accordingly, it is common to walk into a store when its quiet and only find only one or two check-out counter open, even if the supermarket has a dozen or more of them.

The trend in Australian supermarkets nowadays is to have some modified check-out counters that are labelled as “express.” In these check-outs, you can only use them if you are buying 15 items or less. While the notion of express check-outs has been around forever, the more recent trend is to modify the design of express check-out counters to have very limited counter space and no moving roller that pushes your goods toward the operator. This discourages people with a fully-loaded trolley/cart to use the express lane because there is simply not enough room to unload the goods, have them scanned and put them back in the trolley. Therefore, many more shoppers can go through express counters than regular counters because they all have smaller loads.

This in turn frees up the “regular” check-out counters for shoppers with a large amount of goods. Not only do they have a nice long conveyor belt with plenty of room for shoppers to unload all of their goods onto and rolls to the operator, but often there will be another operator who puts the goods into bags for you as well. Essentially this counter is optimised for people who have a lot of goods.

Now if you were to measure the “performance” of express lanes versus regular lanes, I bet you would see two trends.

  • Express lanes would have more shoppers go through them per hour, but less goods overall
  • Regular lanes would have more goods go through them per hour, but less shoppers overall

With that in mind, lets now delve back into the world of disk IO and see if the trend holds true there as well.

Disk latency and IOPS

In the last post, I specifically focused on disk latency by pointing out that most of the latency in a rotational hard drive is from rotation time and seek time. Rotation time is time taken for the drive to rotate the disk platter to the data being requested and seek time is how long it takes for the hard drive’s read/write head to then be positioned over that data. Depending on how far the rotation and head have to move, latency can vary. Closely related to disk latency is the notion of I/O per second or “IOPS”. IOPS refer to the maximum number of reads and writes that can be performed on a disk in any given second. If we think about our supermarket metaphor, IOPS is equivalent to the number of shoppers that go through a check-out.

The math behind IOPS and how latency affects it is relatively straightforward. Let’s assume a fixed latency for each IO operation for a moment. If for example, your disk has a large latency… say 25 milliseconds between each IO operation, then you would roughly have 40 IOPS. This is because 1 second = 1000 milliseconds. Divide 1000 by 25 and you get 40. Conversely, if you have 5 milliseconds latency, you would get 200 IOPS (1000 / 5 = 200).

Now if you want to see a more detailed examination of IOPS/ latency and the maths behind it, take a look at an excellent post by Ian Atkin. Below I have listed the disk latency and IOPS figures he posted for different speed disks. Note that a 15k RPM disk came in at around 175-210 IOPS which suggests a typical latency average of between 4.7 and 5.7 milliseconds. (1000/175 = 5.7 and 1000/210 = 4.7). Note: Ian’s article explains in depth the maths behind the average calculation in this section of his post.

image

The big trolley theory of IOPS…

While that math is convenient, the real world is always different to the theoretical reality I painted above. In the world of shopping, imagine if someone with one or two trolleys full of goods like the picture below, decided to use the express check-out. It would mean that all of the other shoppers have to get annoyed and wait around for this shoppers goods to be scanned, bagged and put back into trolley. The net result of this is a reduced number of shoppers going through the check-out too.

image

While the inefficiencies of a supermarket is something that is easy to visualise for most people, disk infrastructure is less so. So while the size of our trolley has an impact on how many people come through a check-out, in the disk world, the size of the IO request has precisely the same effect. To demonstrate, I ran a basic test using a utility called SQLIO (which I will properly introduce you to in part 7) on one of my virtual machines. Below is the results of writing data randomly to a 500GB disk. In the first test we wrote to the disk using 64KB writes and in the second test we used 4KB writes. The results are below:

Size of Write IOPS Result
64KB 279
4KB 572

Clearly, writing 4KB of data over time resulted in a much higher IOPS than when using 64KB of data. But just because there is a higher IOPS for the 4KB write, do you think that is better performance?

Disk latency and MBPS

So far the discussion has been very IOPS focussed. It is now time to rectify this. In terms of the SQLIO test I performed above, there was one other performance result I omitted to show you – the Megabytes per second (MBPS) of each test. I will now add it to the table below:

Size of Write IOPS Result MBPS Result
64KB 279 17.5
4KB 572 2.25

Interesting eh? This additional performance metric paints a completely different picture. In terms of actual data transferred, the 4KB option did only 2.25 megabytes per second whereas the 64KB transferred almost 8 times that amount! Thus, if you were judging performance based on how much data has been transferred, then the 4KB option has been an epic fail. Imagine the response of 500 SharePoint users, loading the latest 30 megabyte annual report from a document library if SharePoint used 4KB reads … Ouch!

So the obvious question is why did a high IOPS equate to a low MBPS?

The answer is latency again (yup – it always comes back to latency). From the time the disk was given the request to the time it completed, writing 4KB simply doesn’t take as long to write as 64KB does. Therefore there are more IOPS that take place with smaller writes. Add to that, the latency from disk rotation and seek time per IO operation and you start to see why there is such a difference. Eric Slack at Storage Switzerland explains with this simple example:

As an illustration, let’s look at two ways a storage system can handle 7.5GB of data. The first is an application that requires reading ten 750MB files, which may take 100 seconds, meaning the transfer rate is 75MB/s and consumes 10 IOPS. The second application requires reading ten thousand 750KB byte files, the same amount of data, but consumes 10,000 IOPS. Given the fact that a typical disk drive provides less than 200 IOPS, the reads from the second application probably won’t get done in the same 100 seconds that the first application did. This is an example of how different ‘workloads’ can require significantly different performance, while using the same capacity of storage.

Now at this point if I haven’t completely lost you, it should become clear that each of the unholy trinity of latency, IOPS and MBPS should not be judged alone. For example, reporting on IOPS without having some idea of the nature of the IO could seriously mislead. To show you just how much, consider the next example…

Sequential vs. Random IO

Now while we are talking about the IO characteristics of applications, two really important point that I have neglected to mention so far is the range of latency and the impact of sequential IO.

The latency math I did above was deliberately simplified. Seek and rotation time are actually across a range of values because sometimes the disk does not have to rotate the spindle/move the head far. The result is a much reduced seek latency and accordingly, increased IOPS and MPBS. Nevertheless, the IO is still considered random.

Taking that one step further, often we are dealing with large sections of contiguous space on the hard disk. Therefore latency is reduced further because there is virtually no seek time involved. This is known as sequential access. Just to show you how much of a difference sequential access makes, I re-ran the two tests above, but this time writing to sequential areas of the disk and not random. With the reduced seek and rotation time, the difference in IOPS and MBPS is significant.

Size of Write IOPS Result MBPS Result
64KB 2095 131
4KB 4152 16

The IOPS and subsequent MBPS has improved significantly from the previous test to the tune of a 750% improvement. Nevertheless, the size of the request and its relation to IOPS and MPBS still holds true. The smaller the size of the IO request being read or written, the more IOPS requests can be sustained, but the less MBPS throughput can be achieved. The reverse then holds true with larger IO requests.

One conclusion that we can draw from this is that specifying IOPS or MBPS alone has the potential to really distort reality if one does not understand the nature of the IO request in terms of its characteristics. For example: Let’s say that you are told your disk infrastructure has to support 5000 IOPS. If you assumed a 4K IO size that is accessed sequentially, then far fewer disks would be required to achieve the result compared to a 64KB IO accessed randomly. In the 64KB case, you would need many disks in an array configuration.

SQL IO Characteristics

So now we get to the million dollar question. What sort of IO characteristics does SQL and SharePoint have?

I will answer this by again quoting from Ian Atkin’s brilliant “Getting the Hang of IOPS” article. Ian makes a really important point that is relevant to SQL and SharePoint in his article which I quote below:

The problem with databases is that database I/O is unlikely to be sequential in nature. One query could ask for some data at the top of a table, and the next query could request data from 100,000 rows down. In fact, consecutive queries might even be for different databases. If we were to look at the disk level whilst such queries are in action, what we’d see is the head zipping back and forth like mad -apparently moving at random as it tries to read and write data in response to the incoming I/O requests.

In the database scenario, the time it takes for each small I/O request to be serviced is dominated by the time it takes the disk heads to travel to the target location and pick up the data. That is to say, the disk’s response time will now dominate our performance.

Okay, so we know that SQL IO is likely to be random in nature. But what about the typical IO size?

Part of the answer to this question can be found in an appropriately titled article called Understanding Pages and Extents. It is appropriate because as far as SQL server database files and indexes are concerned, the fundamental unit of data storage in SQL Server is an 8KB page. The important point for our discussion is that Disk I/O many read and write operations are performed at the page level. Thus, one might assume that 8KB should be the size assumed when working with IOPS calculations because it is possible for SQL to write 8KB to disk at a time.

Unfortunately though, this is not quite correct for a number of reasons. Firstly, eight contiguous 8KB pages are grouped into something called an extent. Given than an extent is a set of 8 pages, the size of an extent is 64KB. SQL Server generally allocates space in a database on a per-extent basis and performs many reads across extents (64KB). Secondly, SQL Server also has a read-ahead algorithm that means SQL will try and proactively retrieve data pages that are going to be used in the immediate future. A read-ahead is typically from 1 to 128 pages for most editions which translates to between 8KB and 1024KB. (for the record, there is a huge amount of conflicting information online about SQL IO characteristics. Bob Door’s highly regarded SQL Server 2000 I/O basics article is the place to go for more gory detail if you find this stuff interesting).

A read-ahead interlude…

Before we get into SharePoint disk characteristics, it is worthwhile mentioning a great article by Linchi Shea called Performance Impact: Some Data Points on Read-Ahead.  Linchni did an experiment by disabling read-ahead behaviour in SQL Server and measured the performance of a query on 2 million rows. With read-ahead enabled, it took 80 seconds to complete. Without read-ahead it took 210 seconds. The key difference was the size of the IO requests. Without read-ahead the reads were all 8KB as per page size. With read-ahead, it was over 350KB per read. Linchi makes this conclusion:

Clearly, with read-ahead, SQL Server was able to take advantage of large sized I/Os (e.g. ~350KB per read). Large-sized I/Os are generally much more efficient than smaller-sized I/Os, especially when you actually need all the data read from the storage as was the case with the test query. From the table above, it’s evident that the read throughput was significantly higher when read-ahead was enabled than it was when read-ahead was disabled. In other words, without read-ahead, SQL Server was not pushing the storage I/O subsystem hard enough, contributing to a significantly longer query elapsed time.

So for our purposes, lets accept that there will be a range of IO sizes for read/writes to databases between 8KB to 1024KB. For disk IO performance testing purposes, lets assume that much of this is across the extent boundaries of 64KB. Based on our discussion of latency and MBPS where the larger the IO being worked with, the lower the IOPS, we can now get a better sense of just how much disk might need to be put into an array to achieve a particular IOPS target. As we saw with the examples earlier in this post, 64KB IO sizes result in more latency and lower IOPS. Therefore SharePoint components requiring a lot of IOPS may need some pretty serious disk infrastructure.

SharePoint IO Characteristics

This brings us onto our final point for this post. We need to understand what SharePoint components are IO intensive. The best place to start to determine this is page 29 of Microsoft’s capacity planning guide as it supplies a table listing the general performance requirements of SharePoint components. A similar table exists on page 217 of the Planning guide for server farms and environments for Microsoft SharePoint Server 2010. We will finish this post with a modified table that shows all the SharePoint components listed with medium to high IOPS requirements from the capacity planning guide, along with some of the comments from the server farm planning guide. This gives us some direction as to the SharePoint components that should be given particular focus in any sort of planning. Unfortunately, IOPS requirements are inconsistently written about in both documents. Sad smile

Service Application

Service Description

SQL Server IOPS

SharePoint Foundation Service

The core SharePoint service for content collaboration.

Almost all of the IOPS occurs in SharePoint content databases. IOPS requirements for content databases vary significantly based on how your environment is being used, and how much disk space and how many servers you have. Microsoft recommends that you compare the predicted workload in your environment to one of the solutions that they have tested. I will be covering this in part 8.

XXX

Logging Service

The service that records usage and health indicators for monitoring purposes.

The Usage database can grow very quickly and require significant IOPS. Use one of the following formulas to estimate the amount of IOPS required:
115 × page hits/second
5 × HTTP requests

XXX

SharePoint Search Service

The shared service application that provides indexing and querying capabilities. There is a dedicated document that among other things that covers IOPS requirements.

For the Crawl database, search requires from 3,500 to 7,000 IOPS.
For the Property database, search requires 2,000 IOPS.

XXX

User Profile Service

The service that powers the social scenarios in SharePoint Server 2010 and enables My Sites, Tagging, Notes, Profile sync with directories and other social capabilities

No mention of IOPS is made in both the planning guides

XXX

Web Analytics Service

The service that aggregates and stores statistics on the usage characteristics of the farm.

The planning guide suggests readers consult a dedicated planning guide for web analytics, but unfortunately no mention of IOPS is made, let alone a recommendation 

XXX

Project Server Service

The service that enables all the Microsoft Project Server 2010 planning and tracking capabilities in addition to SharePoint Server 2010

No mention of IOPS is made in both the planning guides

XXX

PowerPivot Service

The service to display PowerPivot enabled Excel worksheets directly from the browser

No mention of IOPS is made in both the planning guides

XX

(In case it is not obvious, XX – Indicates medium IOPS cost on the resource and XXX indicates high IOPS cost on the resource)

Conclusion (and coming up next)

Whew! I have to say, that was a fairly big post, but I think we have broken the back of latency, IOPS and MBPS. In the next post, we will put all of this theory to the test by looking at the performance counters that allow us to measure it all, as well as play with a couple of very useful utilities that allow us to simulate different scenarios. Subsequent to that, we will look at these measures from a lead indicator perspective and then examine some of Microsoft’s results from their testing.

Until then, thanks very for reading. As always, comments are greatly appreciated.

Paul Culmsee

www.hereticsguidebooks.com



Demystifying SharePoint Performance Management Part 5 – So what is latency anyway?

Hi all

Welcome to part 5 in my attempt to make SharePoint performance management a little more accessible. Now that we have dealt with the world of request per second in parts two, three and four, we will focus our attention somewhere different for a post or three.

To set the scene, we are going to take a bit of an end to end look at what it takes to load a SharePoint page. I suspect some readers do not have the full picture on just how many components interact together just to load the SharePoint home page. Things are much more complex in reality than the typical architectural view that adorns SharePoint blogs. A typical SharePoint diagram will list the servers and their roles, but what about all…

  • the network gear like routers, switches, reverse proxies and firewalls that are part of the mix?
  • the VMWare or HyperV virtual hosts that provide the virtualised servers? And
  • the storage area network and its associated paraphernalia that these virtual servers make use of for disk infrastructure?

Make no mistake people, configurations these days are hugely complex and have many moving parts. If any of the various components listed above were to fail or become a bottleneck, the performance of the entire system suffers. Therefore, we need assurance that each component has been optimised to ensure overall function.

This brings us onto the topic of latency. If you are not sure what latency is, I can guarantee that you actually do know. You see, if you have ever experienced a jittery skype call, or your pornography is slow to load, or you have watched a roving reporter respond several seconds after being asked a question from the studio, you are experiencing latency.

Now, the important point to make straight up is that latency is unavoidable because of the laws of physics. Take the example of one of the rovers that NASA sent to Mars. All radio signals to Mars travel at the speed of light (which despite Star Trek’s best efforts to persuade us otherwise, is the absolute speed limit of the universe). The speed of light is around 300,000 kilometres per second and the distance to Mars is currently around 150 million kilometres from Earth. So doing some basic math, we find that it takes a little over 8 minutes for a signal to get from Earth to Mars.

  • 150,000,000 / 300,000 = 500 seconds
  • 500 / 60 = 8.3 minutes

In this example of latency, no matter what  happens, there will always be around 8 minutes of latency between the time an instruction is sent to a rover, to the time it receives and acts on it. Unless Einstein was wrong, this isn’t about to change in a hurry either.

A “lean” view of latency…

Latency is a concept that extends beyond the forces of nature. Let me give you another form of latency that I am sure you have experienced, using Microsoft as the straw man. Let’s say you have a problem with SharePoint and you log a call with Microsoft or your support provider. You call the technical support line and after twiddling your thumbs in the telephone queue for an eternity, you get an inexperienced level 1 tech, who doesn’t understand your problem at all and is hell bent on closing your call anyway because someone higher up in the organisation actually believed that call-time is an indicator of happy customers. You repeat yourself each and every time as your call is slowly routed up the tech support hierarchy. Finally, by the time you get to level 3 or 4, you finally get a good tech who gives you the quick answer you were looking for. The problem is that three weeks have passed to get to this point.

This is also a form of latency. But unlike the first example. It was not the laws of nature this time, but man made laws that caused wasted time. I will call it organisational latency. Addressing this form of latency is a multi billion dollar industry, and keeps an army of organisational/process improvement consultants busy, trying to reduce wastage and improve customer outcomes (now you know what Lean is all about if you hear people taking about it).

So, returning to the SharePoint context – we have a lot of moving parts. We know we cannot alter the laws of physics, but how do we know whether all of the various components are working to their optimum level? Is there any man-made latency that we could reduce or eliminate?

Oh, yes, indeedie there is… and to put some context  to it, let’s utilise the musical genius that is the Wiggles. I found that their rendition of the old folk song “Dem bones” serves my purpose nicely.

 

The Wiggles, teaching us about SharePoint latency 🙂

When you perform the seemingly benign task of requesting a page with your browser, an amazing number of things have to happen. The browser forms a HTTP request and then passes this to the TCPIP stack on your PC, which takes the HTTP request and breaks it up into IP packets. These packets are passed to your network card driver that turns these packets into Ethernet frames and sends them over the wire. Each network device (switch, router, etc.) has to process each frame or IP packet and to work out where to forward it. Eventually the request finds it way to the destination server where the Ethernet frames are stripped, the IP packets are reassembled into the original HTTP request, passed to IIS and SharePoint then acts on it.

Now all I described above was the task of getting a HTTP request from a browser to a server. To see the entire picture, let’s all sing along with the Wiggles shall we? We will assume a two server deployment, utilising a VMware based virtual web front end SharePoint server and a physical SQL Server. Both servers use a Storage Area Network (SAN) for disk. Cue the melody from “Dem Bones”…

  • Your PC connects to a distribution switch
  • The distribution switch is connected to the core switch
  • The core switch connects to the HyperV host
  • The HyperV host connects to the virtual Web Front End Virtual Machine

… okay so we have managed to get from our browser to the SharePoint web front end but at this point, the web front end hasn’t really done anything yet.  In terms of latency, we had to get through the switches, as well as the virtualisation infrastructure to the virtual SharePoint web front end box. The switches had very little latency at all – probably around 1-2 microseconds (which translates about 0.001 to 0.002 milliseconds) for a network packet to go in one port and out the other. The virtualisation infrastructure also introduced some latency, because there is overhead in running a virtual machine within a physical machine. However, assuming it is well configured and that there aren’t too many virtual machines competing for physical resources like CPU and memory, then that latency is fairly negligible.

Now the virtual web front end server needs to actually deal with the request from your PC. This involves pulling data from the disk infrastructure, so back to the Wiggles we go…

  • the Web Front End Virtual Machine connects to the HyperV host
  • The HyperV host connects to the SAN Switch
  • The SAN Switch connects to the Storage Array
  • The Storage Array connects to the Web Front End disk
  • The Web Front End disk returns data to the SAN Switch
  • The SAN switch returns data to the HyperV host
  • The HyperV host returns data to the Web Front End Virtual Machine

…at this point, the web front end server has retrieved any data it needs to from the disk subsystem. There was definitely latency here. The SAN switches have a similar latency to network switches which is negligible, but the physical disks on the SAN are another story (which we will get to soon). But wait a second – that just loads the stuff the web front end server stores or caches locally, as well as writing to the IIS and SharePoint logs. What about all those sexy web parts you have on the front page that aggregate the latest news feed? That stuff needs to pull information from the SharePoint content database on the SQL Server. So let’s continue, now incorporating the connection between the virtual web front end and SQL Server (Remember, I am assuming the SQL box is not virtualised).

  • The Web Front End Virtual Machine connects to SQL box (via the network on TCPIP port 1433)
  • The SQL Box connects to the SAN Switch
  • The San Switch connects to the Storage Array
  • The Storage Array connects to the SQL disk
  • The SQL disk returns data to the SAN Switch
  • The SAN switch connects to the SQL box
  • The SQL Box connects to Web Front End Virtual Machine (via the network on TCP port 1433)
  • The Web Front End Server returns the page to your PC (via the network on TCP port 80)

Now at this point, non tech oriented readers might be thinking, “Bloody hell! I didn’t realise there were that many interactions.” For you guys… now you know why tech guys are the way they are. Tech guys reading this would know full well that I glossed over many things. For example, I did not include the authentication process in the sequence above, nor did I describe important virtualisation aspects such as VM memory compression. On top of that I glossed big-time over the full gamut of SAN I/O paths.

There is a form of man-made latency that can occur in any of these steps outlined above as a result of the complexity. It is very easy to overlook an important aspect, or to misconfigure something or to assume the default configuration is optimal. In my consulting experience I have seen sub optimal configuration in all of the above touchpoints, but out of all of them, there is one area that is far more likely to have latency issues than any of the other areas: The disk infrastructure.

We will round out this post by taking a fairly high level view at disk infrastructure and why it is latency prone.

Understanding disk latency

Below is a Wikipedia picture that shows the essential components of most hard drives. This type of hard drive is really not that different from its original design in 1954. It is called a rotational hard drive and the spindle rotates, while the actuator arm moves the head to the right position to read data off the platter. As you can imagine, this happens pretty fast too. Most high end hard drives spin the platter at 15000RPM – dizzying, eh?

 

But to put disk performance in perspective, consider my previous example of a network switch with a 1-2 microsecond latency to process an Ethernet frame as it transits through one network port to another. By comparison, a modern hard drive takes a hell of a lot longer to do what it needs to do. As a simple example, the time taken just for the drive to rotate the disk platter takes around 2 milliseconds (or 2000 microseconds). Not only is this a staggering 2000 times slower than the network switch but it does not take into account the time it takes for the hard drive’s read/write head to then be positioned over the sector (this is called seek time and can take anywhere between 3 and 15 milliseconds).

This latency clearly is problematic, and vendors compensate by utilising multiple sets of disks and liberal use of cache technology to mitigate it. Imagine putting 10 hard disks together and when data is saved, parts of it is written to each hard disk. Now you have reduced latency because each drive is handling a smaller part instead of a single drive handling it all. It is important to note that we have done nothing about laws of physics latency per single drive (thanks Robert Bogue for pointing that out) , but throughput induced latency has reduced by using them all. It is just like when you are out the supermarket and there are ten check-out operators working instead of 1. The wait times are much shorter because there are more check out operators available to service the request. (This is the essence of RAID technology and should be familiar to most readers).

But there is still more to the latency story than disks taking time to do their thing. At the operating system level, there are various layers and drivers doing stuff. I won’t go too much into this is except to suggest that if the world of the Class Drivers, Port Drivers, Device Miniport Drivers and Disk Subsystems rock your world then Jeff Hughes has a great writeup where he describes the whole Windows disk IO system in detail.

A note on SSD

I would be remiss not to make a point about these newfangled Solid State Drives (You might have heard them mentioned as SSD). This is a newer technology for hard drives that do not employ any moving mechanical components, like platters and movable read/write heads. Solid State Drives have some seriously improved performance in terms of latency, because they store the data in persistent memory. Wikipedia cites that SSD latency is around 0.1 millisecond compared to rotational drives being around 5-10 milliseconds. The downside is that they are more expensive than traditional rotational disks. According to a May 2012 article, SSDs cost approximately US$0.65 per GB whereas traditional hard disks cost about US$0.05 per GB. Expect prices to continue to fall and for them to appear in more and more solutions.

Then there are SANs

In terms of disk infrastructure and latency aspects, most organisation’s utilise a Storage Area Network (SAN) topology. I previously mentioned the idea of RAID configurations that make use of multiple disks to improve latency (among other things). SANs take the RAID idea and abstracts it further as shown below.

image

(credit for this image is Orbis solutions: http://orbissolutionsinc.com/blog/tag/storage-arrays/)

I sometimes describe SANs to people as a “fridge full of hard drives connected to multiple servers”. What the above diagram shows is that the disks are physically not connected to the servers that use them. Instead they are connected to a storage array via cables, with a switch or three in between. Each server has some disk space reserved for it on the SAN. So the result is we have one centralised high performing disk array where we can take advantage of all of the disks housed within.

But it’s important to understand here that each time data is read from or written to disk, it passes across those cables and through the switches. Like an internet connection, the SAN switch and cables not only have bandwidth limitations, but are prone to misconfiguration. Imagine 50 servers writing data at the same time. If things are not well configured, the SAN switch infrastructure might become overwhelmed like a freeway during peak hour. Direct attached storage (i.e. – the hard drive or RAID array is plugged into the server directly) typically have a higher bandwidth. This quote from a nice sqlteam.com article on SAN performance explains it well.

For instance, if a server is equipped with two older 1-Gbps host bus adapters (HBAs), its MBps throughput would be capped at about 200MB per second no matter how powerful the rest of the SAN is. Replacing the 1-Gbps HBAs with two newer 4-Gbps HBAs or adding more HBAs may improve the throughput, if the HBAs are indeed the throughput bottleneck. But the SAN drive throughput could also be limited by the maximum throughput of the inter-switch links in the SAN switched fabric. Further down the I/O paths, the front-side adapter ports on the disk array, the cache in the disk array, the disk controllers, and the disk spindles can all become the bottleneck limiting the MBps throughput of the SAN drive.

Conclusion and coming next…

Okay… at this point let’s take a breather. For the tech guys reading this post, none of what I covered may seem particularly earth shattering, but it was important to set the context for a deeper dive into disk latency in the next couple of posts. If you are not normally of the tech persuasion, then I hope that this post has opened your eyes a little to just how complicated deployments can be and accordingly, how hard it can sometimes be to pinpoint latency issues when they occur.

In the next post, we will take a deeper look at disk latency and its relationship to the indicators of IOPS and MBPS. We will then examine tools to measure latency and how to best use it as a lead indicator.

Until then, thanks for reading and be sure to check out my recent business book “The Heretics Guide to Best Practices

Paul Culmsee

www.sevensigma.com.au



Demystifying SharePoint Performance Management Part 4 – Making use of RPS

Hi all. Welcome to part 4 of a rapidly growing series of posts on trying to take some of the mystery out of SharePoint performance management. Essentially I am trying to write a sort of preamble to the existing Microsoft resources which are extremely comprehensive and full of wisdom, but suffer from being rather large and a lot to get through. To remind you, we have the 307 page “Planning guide for server farms and environments for Microsoft SharePoint Server 2010,” the 367 page “Capacity Planning for Microsoft SharePoint Server 2010” and the lesser known, but equally excellent 23 pages of “Analysing Microsoft SharePoint Products and Technologies Usage” whitepaper.

My hope that this series establishes just enough groundwork for someone to find the aforementioned documents an easier read and get more out of them.

Now this series is starting to turn out like the “Humble Tribute to the Leave Form” series, which I never actually finished (*blush*). Basically the number of posts to complete it exceeded the time I had available to write it (and my interest shifted to other things). For this topic of performance, I originally thought this series might be 4 posts but we are now at post 4 and haven’t actually gotten off the Requests Per Second (RPS) performance counter yet.

So let’s cracking…

Command line alert (again)

I have a tendency to have fun at the expense of IT stereotypes in my posts, and in the interests of fairness, I turned this around in part 3 I and took the piss out of the “I’m business, not technical” wusses instead. You all know who you are… you tend to shun anything that involves the command line as if it was the most complex thing ever. So continuing in that vein, if I managed not to completely scare the crap out of you in the last post, you should have the excellent log parser utility installed and have created a file called LogWithSeconds.csv.  If you have not, go back and read part 3. To remind you quickly, the log parser command that we used to generate the LogWithSeconds.csv file was:

logparser -i:IISW3C file:GetSeconds.txt?startdate=’2011-11-15′+enddate=’2011-11-15′ -o:csv >LogWithSeconds.csv

The key point being that you can specify a date range for the logs you want to process.

For the rest of this article, we continue to play in the command-line playground and utilise some different logparser scripts to derive some useful information. In addition, we will utilise a bit of PowerShell, as well as check out another great free utility written by Nikander & Margriet Bruggeman (more on that later).

Also at this point I need to call out and credit the excellent work of Mike Wise. His aforementioned 2009 whitepaper called“Analysing Microsoft SharePoint Products and Technologies Usage” is the basis for what I cover here. I urge you to download and read this article as it goes into more detail on Log Parser and its uses beyond just RPS alone. Although I have based my stuff off Mike’s work, there are some differences that you will see as we progress through this article.

Distribution of RPS

The one thing that past examination of RPS can give you is a distribution of requests over time. Understanding the distribution (or shape of RPS) helps you to identify patterns to SharePoint use, such as peak or heavy usage times. To that end, the first log parser script will generate a CSV file that can be imported into a tool like excel to chart the distribution of RPS. The log parser script below has been modified from one in Mike’s document, because he assumes you are only looking at 24 hours of log data. In my case, I assume that you might want to profile more than 24 hours (essentially the date range specified in the log parser command above).

The command to generate a per-second RPS distribution is below. The only difference between my script and the one Mike did is I added the “date” field to the SQL to account for multiple days:

logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss from e:\temp\LogWithSeconds.csv group by date,secs order by date, secs” -q >RPSDistribution.csv

This command will create a new CSV file called RPSDistribution.CSV that contains the count of requests at any given second during the specified date range. So let’s open RPSDistribution.CSV into Excel and create a chart (I assume you know how to do that). Here is what it looks like…

image

Now I wonder if you can spot the issue with this chart? If you look closely, note that the times are not evenly spaced. This occurs because the generated file (RPSDistribution.CSV) only contains entries for the seconds during the day where there were requests. If no requests were made, then nothing was recorded. This skews the graph because if we want to see the distribution of requests, we also need to know the seconds of the day where there were zero requests. The graph you see above has effectively squeezed out all of the quiet times.

To work around this issue, I wrote the following PowerShell script. For you non-programmers, I am not going to explain all of the gory logic of this script, but just be assured that it adds entries stating zero RPS for every second of the day where there were no requests made. This will normalise the data across time and make a much more meaningful graph.

(If this is starting to hurt your brain, stick with me… paste the code below into notepad, save it in the Log parser installation folder and call it AddNulsPerSec.ps1)

param([string]$inputcsv, [string]$outputcsv = "output.csv")
if (!$inputcsv) {
    write-host "The -inputcsv parameter has not been specified. Script cannot run without it";
    exit;
}
if (Test-Path -path $outputcsv) { remove-item $outputcsv }
$x = 0;
$y = import-csv $inputcsv
write-output "ct,date,secs,minu,hh,mm,ss" | add-content -path $outputcsv
$y | foreach {
    if ($x -gt 86399) { $x = 0 }
    $s = [int]$_.secs;
    while ($s -gt $x) {
        $d = [datetime]$_.date;
        $d=$d.AddSeconds($x)
        $ss = $d.tostring("ss")
        $mm = $d.tostring("mm")
        $hh = $d.tostring("HH")
        $minu = [int]$hh * 60 + [int]$mm
        $output = "0" + "," + $_.Date + "," + $x + "," + $minu + "," + $ss + "," + $mm + "," + $hh
        write-output $output
        $x++;
   }
   $output = $_.ct + "," + $_.Date + "," + $_.secs + "," + $_.minu + "," + $_.ss + "," + $_.mi + "," + $_.hh
   write-output $output
   $x++;
} | add-content -path $outputcsv

The above script takes two command-line parameters: inputCSV and outputCSV. inputCSV is the file name to process and outputCSV is the resulting file with the 0 RPS entries added. Note that to run this script you will need to use a PowerShell window, rather than a command prompt. Below is the command I used:

PS C:\Program Files (x86)\Log Parser 2.2> .\AddNulsPerSec.ps1 -inputcsv RPSDistribution.CSV -outputcsv RPSDistributionNormalised.CSV

This created the file RPSDistributionNormalised.CSV. I charted this file in Excel and we now have a time-normalised distribution. Take a look at the X axis. This looks more logical now as the times are more evenly spaced. It seems from looking at this, that peak times are between 10am-11am, although one could argue that a lot of the day was fairly busy, with a bit of a lull between 2 and 3pm.

image

So what else can we do?

Right, so apart from the utility of being able to get a sense of when there are a lot of requests versus quiet times, can we find out anything else useful? Much insight can be gleaned from Mike Wise’s document, so here I will cover a couple of things specific to RPS.

RPS distribution for certain users

Let’s go back to the LogWithSeconds.CSV we started with and find out the top users for the period being examined. In the log parser command below we are grouping users by total requests they made, ordering from largest to smallest..

logparser -i:csv “select top 20 count(*) as ct,cs-username as user from LogWithSeconds.csv group by user order by ct desc”

A snippett of the output from this command is below:

ct  user
--- --------------------
840 DOMAIN\Jame.Smith
688 DOMAIN\searchcrawler
614 DOMAIN\Ian.Jones
508 DOMAIN\Steve.Hill
357 DOMAIN\Ant.Cough
313 DOMAIN\dom.davies
260 DOMAIN\matthew.martin

Hmm, I notice that the search crawler account (DOMAIN\searchcrawler) was busy during that day. It appears to have made the second largest number of requests. How about we work out when the search crawler was active by filtering the requests just for this user. Perhaps search crawls are active during peak times and introducing unnecessary load on the server?

First up, lets create the RPS distribution, but this time just for the search crawler account (note the SQL WHERE clause in the command below)

logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss,cs-username as user from LogWithSeconds.csv where user=’DOMAIN\searchcrawler’  group by user, date,secs order by date, secs” -q>crawler.csv

Now we need to pad CRAWLER.CSV out with 0 entries to time-normalise it for the seconds in which it wasn’t active…  back to my PowerShell script…

PS C:\Program Files (x86)\Log Parser 2.2> .\AddNulsPerSec.ps1 -inputcsv crawler.csv -outputcsv CrawlerNormalised.csv

I then took the results from CrawlerNormalised.csv and added them to my previous RPS distribution chart in Excel. Straight away you can see the incremental crawl schedule of this SharePoint installation is 5 hourly. (Note the red lines at regular intervals)

image

RPS Distribution for certain clients…

Another use for RPS is to see the pattern of the various applications that interact with SharePoint. Aside from the trusty old browser, we also have Office clients, Windows Explorer, SharePoint Workspace, and 3rd party tools like SharePlus. All of these applications identify themselves to SharePoint via the use of something called the user-agent [stored in the LogWithSeconds.CSV file in a column called cs(user-agent)]. The user agent field is actually part of the HTTP standard and not SharePoint specific, but let’s take advantage of it…

logparser -i:CSV “select count(*) as ct,cs(user-agent) from LogWithSeconds.CSV group by cs(user-agent) order by ct desc” -q >BrowserList.csv

Now, I am not going to paste the complete output of running this command because unfortunately, browsers have a lot of variation in their user agent string. Nevertheless, here are some of results from the BrowserList.csv file…

867 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1;+Trident/4.0;+.NET+CLR+1.1.4322;+.NET+CLR+2.0.50727;+.NET+CLR+3.0.04506.30;+.NET+CLR+3.0.04506.648;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729;+yie8)
688 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+NT;+MS+Search+6.0+Robot)
386 harmon.ie+for+Notes
333 Mozilla/5.0+(Windows;+U;+Windows+NT+6.1;+en-US;+rv:1.9.2.24)+Gecko/20111103+Firefox/3.6.24
250 SharePlus+2.9.5+(iPad;+iPhone+OS+4.2.1;+en_AU)
99  MSFrontPage/12.0
95  Mainsoft+SharePoint+Integrator
68  Microsoft+Office/14.0+(Windows+NT+6.1;+OWSSUPP+14.0.6112;+Pro)
34  Microsoft-WebDAV-MiniRedir/6.1.7600
24  Microsoft+Office+Outlook+2010+(14.0.6112)+Windows+NT+6.1
23  Microsoft+Office+Sharepoint+Workspace+2010+(14.0.6112)+Windows+NT+6.1
4   MobileSafari/7534.48.3+CFNetwork/548.0.3+Darwin/11.0.0

Now looking at these strings, it doesn’t take long to get a sense of the different ways SharePoint has been accessed. How about we generate a distribution of RPS for all iPad devices or apps? Here’s the log parser command along with my time normaliser script.

logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss,cs(User-Agent) as ua from LogWithSeconds.CSV where ua like ‘%iPad%’ group by ua, date,secs order by date, secs”-q >iPads.csv

.\AddNulsPerSec.ps1 -inputcsv E:\temp\ipads.csv -outputcsv e:\temp\iPadNormalised.csv

… and the result when added into Excel? iPads were definitely the evening tool of choice that day! Note the green spikes around 9-10pm.

image

Taking it further…

I won’t do much more on RPS now. Hopefully I have given you enough to do much more clever things than I have covered. As I stated in part 2, the nice thing about RPS is that it can be derived from web server logs and these tend to go back quite far in time. Given that the core sequence of events to produce the graphs above are essentially 3 scripts and can be done quickly, it becomes quite easy to sample different points in history. For example: Let’s say that we want to compare the 15th of November 2011 with the 10th of March 2012 to see whether there is an increase/decrease in requests and what this looks like. All we have to do is change the date, re-run the scripts and do some charting magic.

  • logparser -i:IISW3C file:GetSeconds.txt?startdate=’2012-3-10′+enddate=’2012-3-10′ -o:csv >LogWithSeconds.csv
  • logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss from LogWithSeconds.csv group by date,secs order by date, secs” -q >RPSDistribution.csv
  • AddNulsPerSec.ps1 -inputcsv RPSDistribution.CSV -outputcsv RPSDistributionNormalised.CSV

We can look at the distribution for an entire week is we wanted to…

  • logparser -i:IISW3C file:GetSeconds.txt?startdate=’2012-5-7′+enddate=’2011-5-11′ -o:csv >LogWithSeconds.csv
  • logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss from LogWithSeconds.csv group by date,secs order by date, secs” -q >RPSDistribution.csv
  • AddNulsPerSec.ps1 -inputcsv RPSDistribution.CSV -outputcsv RPSDistributionNormalised.CSV

Also in case you didn’t notice in part 3, my GetSeconds.txt logparser script that performs the initial processing of the IIS logfiles, also stores the minute of the day, as well as seconds of the day. This allows you to perform all of the same things I have specified in this article except it can be Requests Per Minute (RPM) instead. This would allow you to work with a larger date range without such big files (provided RPM is appropriate for what you want). Consult the “Analysing Microsoft SharePoint Products and Technologies Usage” whitepaper for more information on logparser queries for RPM scenarios.

Remember that the virtues of your web server logs go much further than RPS. We saw a hint of this in my examples of showing RPS for just one user or one device type. But this really is just scratching the surface of what can be gleaned via logparser. There are many excellent logparser scripts around, and a quick google search should give you plenty of examples.

Also remember that there are many more sophisticated ways to process this sort of data. For example, putting it into Analysis Services and slicing/dicing it via PowerPivot, or using something like RRDTool. To that end, I would also be seriously remiss if I did not make you aware of the SharePoint Flavored Weblog Reader tool. It was created by Nikander & Margriet Bruggeman who run the SharePoint Dragons blog – probably the best SharePoint performance related blog out there. This tool was specifically designed to make it easier to analyse IIS logs for SharePoint specific information. It is a command line tool, but much simpler and slicker than the methods I introduced in this post. Instead of specifying a date range you specify the number of items from the logs to process. For example:

sfwr.exe 250,000 “E:\LOGS\IIS_WWW\W3SVC1045333159”

Here are some of the things it reports on for you straight out of the box:

  • Most busy days of the week, most requested pages and requested pages per day.
  • The average, max and min request times per URI, InfoPath URI and Report Server URI.
  • Browser percentages, dead links, failed pages, percentage of error page requests.
  • Requests per hour per day, requests per hour, requests per user (also per week and month)
  • Slowest requests, top requests per hour, top visitors.
  • Traffic per day and per week in MB.
  • Unique visitors per day, week and month

Reflections…

At the end of the day, while examining the pattern of RPS can be very handy in offering insights into how your web application (SharePoint or otherwise) performs, as a lead indicator it is always going to be fairly wishy washy. As soon as you turn your attention to the future, many variables come into play that you cannot predict as accurately as you’d like. Your existing webserver logs can offer you a lot of ways to help make a more informed prediction, but at the end of the day, one has to take into account the new unique features of SharePoint 2010, how they will be used and so forth.

I will be returning to this theme, once we examine some other performance indicators, but hopefully at this point, you might find some aspects of Microsoft’s Capacity planning for Microsoft SharePoint Server 2010 guide less intimidating. Page 43 in particular has some great material that builds upon what we cover here. To quote Microsoft…

Understanding the distribution of the requests based on the clients applications that are interacting with the farm can help predict the expected trend and load changes after migrating to SharePoint Server 2010. As users transition to more recent client versions such as Office 2010, and start using the new capabilities new load patterns, RPS and total requests are expected to grow

I will leave you with a terrific example a graph that Microsoft created using IIS logs (on page 44 of the aforementioned document). This is a view of a typical day in an internal Microsoft environment serving what is deemed “a typical social solution”. It shows just how much additional load a new feature can introduce (in this case, Outlook Social Connector feature is 6.2% of the total number requests). The combination of different clients on the X axis, with the % distribution of overall and per-user requests on the Y axis is really handy.

image

Coming up next…

At this point I think we have covered RPS sufficiently, and dovetailed in nicely to Microsoft documentation – particularly pages 41-47 of the SharePoint 2010 Capacity Planning Guide. Our next stop will be looking at another much misunderstood lead indicator for performance – disk IO and latency. Once again I will introduce you to a couple of useful tools and offer you what I think is the best way to approach disk performance requirements that will make life easier for you and your storage people.

Until then, thanks for reading…

Paul Culmsee

HGBP_Cover-236x300



Demystifying SharePoint Performance Management Part 3 – Getting at RPS

Hi and welcome back to this series aimed at making SharePoint performance management a little more digestible. In the first post we examined the difference between lead and lag indicators and in the second post, we specifically looked at the lead indicator of Requests Per Seconds (RPS) and its various opportunities and issues. In this episode we are actually going to do some real work at the – wait for it – the command line! As a result the collective heart rates of my business oriented readers – who are avid users of the “I’m business, not technical” cliché  – will start to rise since anything that involves a command line is shrouded in mystique, fear, uncertainty and doubt.

For the tech types reading this article, please excuse the verboseness of what I go through here. I need to keep the business types from freaking out.

Okay… so in the last post I said that despite its issues in terms of being a reliable indicator of future performance needs, RPS has the advantage that it can be derived from your existing deployment. This is because the information needed is captured in web server (IIS) logs over time. Having this past performance means you have a lag indicator view of RPS, which can be used as a basis to understand what the future might look like with more confidence than some arbitrary “must handle x RPS.”

Now just because RPS is held inside web server log files, does not mean it is easy to get to. In this post, I will outline the 3 steps needed to manipulate logfiles to extract that precious RPS goodness. The utility that we are going to use to do this is Log parser.

Now a warning here: This post assumes your existing deployment runs on Microsoft’s IIS platform v7 (the webserver platform that underpins SharePoint 2010). If you are running one of the myriad of portal/intranet platforms, you are going to have to take this post as a guide and adjust to your circumstances.

Step 1: Getting Log parser

Installing Log parser is easy. Just install version 2.2 as you would any other tool. It will run on pretty much any Windows operating system. Once installed, it will likely reside in the C:\Program Files (x86)\Log Parser 2.2 folder. (Or C:\Program Files\Log Parser 2.2 if you have an older, 32 bit PC).

There you go business types – that wasn’t so hard was it?

Step 2: Getting your web server logs

After the relative ease of getting log parser installed, we now need the logs themselves to play with. We are certainly not going to mess with a production system so we will need to copy the log files for your current portal to the PC where you installed Log parser. If you do not have access to these log files, call your friendly neighbourhood systems administrator to get them for you. If you have access (or do not have a friendly neighbourhood systems administrator), then you will need to locate the files you need. Here’s how:

Assuming you have access to your web front end server/s, you can load Internet Information Services (IIS) Manager from Start->All programs->Administrative tools on the server. Using this tool we can find out the location of the IIS log files as well as the specific logs we need. By default IIS logs are stored in C:\inetpub\logs\LogFiles, but it is common for this location to be changed to somewhere else. To confirm this in IIS manager, click on the server name in the left pane, then click on the Logging” icon in the right pane. In the example below, we can see that the IIS logfiles live in G:\LOGS\IIS folder (I always move the logfiles off C:\ as a matter of principle). While you are there, pay special attention to the fairly nondescript “Use local time for file naming and rollover” tickbox. We are going to return to that later…

image  image

Okay so we know where the log files live, so lets work out the sub-folder for the specific site. Back in the left hand pane now, expand “Sites” and find the web site you want to profile for RPS. When you have found it, select it and find the “Advanced Settings” link and click it.

image_thumb5

On the next screen you will see ID of the site. It will be a large number – something like 1045333159. Take a note of this ID, because all IIS logs for this site will be stored in a folder with the name “W3SVC” prepended to this ID (eg W3SVC1045333159). Thus the folder we are looking for is G:\LOGS\IIS\W3SVC1045333159. Copy the contents of this folder to the computer where you have installed logparser to. (In my example below I copied the logs to E:\LOGS\IIS_WWW\W3SVC1045333159 on a test server).

image10_thumb2 image14_thumb

Step 3: Preparation of log files…

Okay, so now we have our log files copied to our PC, so we can start doing some log parser magic. Unfortunately default IIS logfile format does not make RPS reporting particularly easy and we have to process the raw logs to make a file that is easier to use. Now business people – stay with me here… the payoff is worth the command line pain you are about to endure! Smile

First up, we will make use of the excellent work of Mike Wise (You can find his original document here), who created a script for log parser that processes all of the logfiles and creates a single (potentially very large) file that:

  • includes a new field which is the time of the day converted into seconds
  • splits the date and timestamp up into individual bits (day, month, hour, minute, etc.) This makes it easier to do consolidated reports
  • excludes 401 authentication requests (way back in part 1 I noted that Microsoft excludes authentication traffic from RPS)

I have pasted a modified version of Mike’s log parser script below, but before you go and copy it into Notepad, make sure you check two really important things.

  1. Be sure to change the path in the second last line of the script to the folder where you copied the IIS logs to (In my case it was E:\LOGS\IIS_WWW\W3SVC1045333159\*.log)
  2. Check whether IIS is saving your logfiles using UTC timestamps or local timestamps. (Now you know why I told you to specifically make note of the “Use local time for file naming and rollover” tickbox earlier). If the box is unticked, the logs are in UTC time and you should use the first script pasted below. If it is ticked, the logs are in local time the second script should be used.

UTC Script

select EXTRACT_FILENAME(LogFilename),LogRow, date, time, cs-method, cs-uri-stem, cs-username,
c-ip, cs(User-Agent), cs-host, sc-status, sc-substatus, sc-bytes, cs-bytes, time-taken,

add(
    add(
         mul(3600,to_int(to_string(to_localtime(to_timestamp(date,time)),'hh'))),
         mul(60,to_int(to_string(to_localtime(to_timestamp(date,time)),'mm')))
    ),
    to_int(to_string(to_localtime(to_timestamp(date,time)),'ss'))
) as secs,

add(
    mul(60,to_int(to_string(to_localtime(to_timestamp(date,time)),'hh'))),
    to_int(to_string(to_localtime(to_timestamp(date,time)),'mm'))
) as minu,

to_int(to_string(to_localtime(to_timestamp(date,time)),'yy')) as yy,
to_int(to_string(to_localtime(to_timestamp(date,time)),'MM')) as mo,
to_int(to_string(to_localtime(to_timestamp(date,time)),'dd')) as dd,
to_int(to_string(to_localtime(to_timestamp(date,time)),'hh')) as hh,
to_int(to_string(to_localtime(to_timestamp(date,time)),'mm')) as mi,
to_int(to_string(to_localtime(to_timestamp(date,time)),'ss')) as ss,
to_lowercase(EXTRACT_PATH(cs-uri-stem)) as fpath,
to_lowercase(EXTRACT_FILENAME(cs-uri-stem)) as fname,
to_lowercase(EXTRACT_EXTENSION(cs-uri-stem)) as fext

from e:\logs\iis_www\W3SVC1045333159\*.log

where sc-status<>401 and date BETWEEN TO_TIMESTAMP(%startdate%, 'yyyy-MM-dd') and TO_TIMESTAMP(%enddate%, 'yyyy-MM-dd')

Local Time Script

select EXTRACT_FILENAME(LogFilename),LogRow, date, time, cs-method, cs-uri-stem, cs-username,
c-ip, cs(User-Agent), cs-host, sc-status, sc-substatus, sc-bytes, cs-bytes, time-taken,

add(
   add(
      mul(3600,to_int(to_string(to_timestamp(date,time),'hh'))),
      mul(60,to_int(to_string(to_timestamp(date,time),'mm')))
   ),
   to_int(to_string(to_timestamp(date,time),'ss'))
) as secs,

add(
   mul(60,to_int(to_string(to_timestamp(date,time),'hh'))),
   to_int(to_string(to_timestamp(date,time),'mm'))
) as minu,

to_int(to_string(to_timestamp(date,time),'yy')) as yy,
to_int(to_string(to_timestamp(date,time),'MM')) as mo,
to_int(to_string(to_timestamp(date,time),'dd')) as dd,
to_int(to_string(to_timestamp(date,time),'hh')) as hh,
to_int(to_string(to_timestamp(date,time),'mm')) as mi,
to_int(to_string(to_timestamp(date,time),'ss')) as ss,
to_lowercase(EXTRACT_PATH(cs-uri-stem)) as fpath,
to_lowercase(EXTRACT_FILENAME(cs-uri-stem)) as fname,
to_lowercase(EXTRACT_EXTENSION(cs-uri-stem)) as fext

from e:\logs\iis_www\W3SVC1045333159\*.log

where sc-status<>401 and date BETWEEN TO_TIMESTAMP(%startdate%, 'yyyy-MM-dd') and TO_TIMESTAMP(%enddate%, 'yyyy-MM-dd')

After choosing the appropriate script and modifying the second last line, save this file into the Log parser installation folder and call it GETSECONDS.TXT.

For the three readers who *really* want to know, the key part of what this does is to take the timestamp of each log entry and turn it into what second of the day it is and what minute of the day it is. So assuming the timestamp is 8:35am at the 34 second park, the formula effectively adds together:

  • 8 * 3600 (since there are 3600 seconds in an hour)
  • 35 * 60 (60 seconds in a minute)
  • 34 seconds

= 30934 seconds

  • 8 * 60 (60 minutes in an hour)
  • 35 minutes

= 515 minutes

Now that we have our GETSECONDS.TXT script ready, let’s use Log parser to generate our file that we will use for reporting. Open a command prompt (for later versions of windows make sure it is an administrator command prompt) and change directory to the LogParser installation location.

C:\Program Files (x86)\Log Parser 2.2>

Now decide a date to report on. In my example, the logs go back two years and I only want the the 15th of November 2011. The format for the dates MUST be “yyyy-mm-dd” (e.g. 2011-11-15).

Type in the following command (substituting whatever date range interests you):

logparser -i:IISW3C file:GetSeconds.txt?startdate=’2011-11-15’+enddate=’2011-11-15′ -o:csv >e:\temp\LogWithSeconds.csv

  • The –i parameter specifies the type of input file. In this case the input file is IISW3C (IIS weblog format)
  • The ?startdate parameter specifies the start  date you want to process
  • The +enddate parameter specifies the end date you want to process
  • The –o parameter specifies the type of output file. In this case the output file is CSV format
  • The –q parameter says not to prompt the user for anything
  • The >LogWithSeconds.csv says to save the CSV output into a file called LogsWithSeconds.csv

So depending on how many logfiles you had in your logs folder, things may take a while to process. Be patient here… after all, it might be processing years of logfiles (and now you know why we didn’t do this in a production install!). Also be warned, the resulting LogWithSeconds.csv that is created will be very very big if you specified a wide date range. Whatever you do, do not open this file with notepad if its large! We will be using additional log parser scripts to interrogate it instead.

Conclusion

Right! If you got this far and your normally not a command line kind of person… well done! If you are a developer, thanks for sticking with me. You should have a newly minted file called LogWithSeconds.csv and you are ready to do some interrogation of it. In the next post, I will outline some more logparser scripts that generate some useful information!

Until then, thanks for reading

Paul Culmsee

p.s Why not check out my completely non SharePoint book entitled “The Heretics Guide to Best Practices”. It recently won a business book award.



Demystifying SharePoint Performance Management Part 2 – So what is RPS anyway?

Hi all

I never mentioned it in the first post that the reason I am blogging again is I finally completed most of the game Skyrim. Man – that game is dangerous if you value your time!

Anyway, in the first post, I introduced this series by covering the difference between lead and lag performance indicators. To recap from part 1, a lead indicator is something that can help predict an outcome by measuring an action, whereas a lag indicator measures the result or outcome achieved from taking an action. This distinction is important to understand, because otherwise it is easy to use performance measurements inappropriately or get very confused. Lead indicators in particular sometimes feel wishy washy because it is hard to have a direct correlation to what you are seeing.

In this post, we are going to examine one of the most commonly cited (and abused) lead indicators to measure for performance. Good old Requests Per Second (RPS). Let’s attempt to make this more clear…

Microsoft defines RPS as:

The number of requests received by a farm or server in one second. This is a common measurement of server and farm load. The number of requests processed by a farm is greater than the number of page loads and end-user interactions. This is because each page contains several components, each of which creates one or more requests when the page is loaded. Some requests are lighter than other requests with regard to transaction costs. In our lab tests and case study documents, we remove 401 requests and responses (authentication handshakes) from the requests that were used to calculate RPS because they have insignificant impact on farm resources

So according to this definition, RPS is any interaction between browsers (or any other device or service making web requests) and the SharePoint webserver, excluding authentication traffic. The logic of measuring requests per second is that it provides insight into how much load your SharePoint box can take because, after all, SharePoint at the end of the day is servicing requests from users.

RPS by example

Before we start picking apart RPS and its issues, let’s look at an example. Assuming you are viewing this page in Internet Explorer version 8 or 9, press F12 right now. You should see something like the screen below. If you have not seen it before, it is called the internet explorer developer tools and is bloody handy. Now click on the “Network” link, highlighted below and then click the “Start capturing” button.

image

Now refresh this page and watch the result. You should see a bunch of activity logged, looking something like the picture below.

image

What you are looking at is all of the requests that your browser has made to load this very page. While the detail is not overly important for the purpose of this post, the key point is that to load this page, many requests were made. In fact if you look in the left-bottom corner of the above screenshot, a total of 130 individual requests are listed.

So, first pop-quiz for the day: Were all 130 requests made to my cleverworkarounds blog to refresh this page? The answer my friends is no. In actual fact, only 2 items were loaded from my blog!

So why the discrepancy? What happened to the other 128 requests? Two main reasons.

1. Browser cache: First up, many of the items listed above were cached by my browser already. I’ve been to this site before, and so a lot of the page components (CSS style sheets, logos and the like) did not have to be retrieved again. It just happens that the internet explorer developer tool shows requests that were handled by locally cached data as well as actual requests made to SharePoint. If you look closely at the “Result” column in the above screenshot, you will see that some entries are grey colour while others are black. All of the grey entries are cached requests. They never left the confines of the browser. This alone accounts for 95 of the 130 requests.

Now this is worth consideration because if a browser has never accessed this site before, there will be no content in the browser cache. Therefore, on first access, the browser would indeed have made 95 additional requests to load the page. This scenario is most likely on day one of a production SharePoint rollout, where a large chunk of the workforce might load the homepage for the first time.

2. Content from other sites: The second reason for the discrepancy is that some content doesn’t even come from the cleverworkarounds site. Anytime you visit a blog and it has a snazzy widget like Amazon books or Facebook “like” buttons, that content is very likely being retrieved from Amazon or Facebook. In the case of this very article you are reading, 33 requests were made to other sites like Facebook, amazon, feedburner, sharepointads and whoever else happens to grace a widget on the right hand side. In these cases, my server is not handling this traffic at all. This accounts for 33 of the 130 requests.

95 + 33 = 128 of the 130 requests made.

So hopefully now you get what is meant by RPS. Let’s now look at its utility in measuring performance.

Dangers of RPS reliance…

Consider two fairly typical SharePoint transactions: The first example is loading the SharePoint home page and the second example is where a user loads a document from a SharePoint document library. Below I have compared the two transactions by using an Office 365 site of mine and capturing the requests made by each one. (For what its worth, I used a utility called fiddler rather than the developer toolbar because it has some snazzier features).

In example #1, we have loaded the homepage of an Office365 site (assuming for the first time). In all, 36 requests made to the server. If we add up the amount of data returned by the server (summing the “Body” column below), we have a total of 245,322 bytes received.

image

In request #2, we are looking at the trace of me opening a 7 megabyte document from a document library. Notice that this time, 17 requests were. But compared to the first example, significantly more data was returned from the server: 7,245,876 KB in fact. If you drill down further by examining the “Body” column, you will notice that of those 17 requests, 3 of them were the bulk of the data transferred with 3,149,348, 3,148,008 and 891,069 KB respectively.

image

So here is my point. Some requests are more significant than others! In the latter example, 3 of the 17 requests transferred 98% of the data. The second transaction also took much longer than the first, and the data was retrieved from the SQL Server database, which meant that this interaction with SharePoint likely had more back-end performance load than the first example when the home page was loaded. When loading the home page, the data may have been served from one of the many SharePoint caches and barely touching the back-end SQL box.

Now with that in mind, consider this: The typical rationale you see around the interweb for utilising RPS as a performance tool is to estimate future scalability requirements. Statements like "This SharePoint farm needs to be capable of 125RPS” are fairly common. Traditionally, the figure was derived from a methodology that looked something like:

  1. Work out the peak times of the day for SharePoint site usage (for example between 10:45am-2:45pm each day)
  2. Estimate the number of concurrent users accessing your SharePoint site during this time
  3. Classify the users via their usage profile (wussy, light, heavy, psycho, etc)
  4. Estimate how many transactions each of these user types might make in the peak hour (a transaction being an operation like browse home page, edit document, and so on)
  5. Multiply concurrent users by the number of expected transactions to derive the total number of transactions for the period
  6. Divide the total by the number of seconds in the period to work out how many transactions per second.

There are lots of issues with this methodology, but here are 4 obvious ones.

  1. The first is that it confuses transactions with requests. While browsing the SharePoint home page might be considered one “transaction”, it will likely consist of more than one request (particularly if the content being served is designed to be fairly dynamic and not rely on cache data). Essentially this methodology may underestimate the number of requests because it assumes a 1:1 relationship between a transaction and a request. My two examples above demonstrate that this is not the case.
  2. The classification of usage profile of users (light, medium, heavy) is crude and overlooks the aforementioned variation in usage patterns. A “heavy user” might continually update a SharePoint calendar, while a “light” user might load 20 megabyte documents or run sophisticated reports. In both cases, the real load on the infrastructure – and the resulting response time – may be quite varied.
  3. It fails to take into consideration the fact that SharePoint 2010 in particular has many new features in the form of Service Applications. These also make requests behind the scenes that have load implications. The most obvious example is the search crawling SharePoint sites.
  4. It also overlooks the fact that SharePoint content is often accessed indirectly. Many non-browser client tools such as SharePoint Workspace, OneNote, Outlook Social Connector, Harmon.ie and the like. If Colligo Contributor is deployed to all desktops, does that make all users “heavy?”

So hopefully by now, you can understand the folly of saying to someone “This system should be capable of handling 150RPS.” There is simply far too many variables that contribute to this, and each request can be wildly different in terms of real load on the back-end servers. Now you know why Robert Bogue likened this issue to Drakes Equation in part 1. The RPS target arrived at utilising this sort of methodology is likely to be fairly inaccurate and of questionable value.

So what is RPS good for and how do I get it?

So am I anti RPS? Definitely not!

The one thing RPS has going for it, which makes it incredibly useful, is that it is likely to be the one performance metric that any organisation can tap into straight away (assuming you have an existing deployment). This is because the metric is collected in web server (IIS) logs over time. Each request made to the server is logged with a date and timestamp. For most places, this is the only high fidelity performance data you have access to, because many organisations do not collect and store other stats like CPU and Disk IO performance over time. While its unlikely you would be able to see CPU for a server 6 months ago on Tuesday at 9:53am, chances are you can work out the RPS at that time if you have an existing intranet or portal. The reason for this is that IIS logs are not cleared so you have the opportunity to go back in time and see how a SharePoint site has been utilised.

The benefit is that we have the means to understand past performance patterns of an organisations use of their intranet or portal. We can work out stuff like:

  • peak times of the day for usage of the portal based on previous history
  • the maximum number of requests that the server has ever had to process
  • the rate of increase/decrease of RPS over time (ie “What was peak RPS 6 months ago?  What was it 3 months ago?)
  • the patterns/distribution of requests over a typical day (peaks and troughs – we can see the “shape” of SharePoint usage over a given period)

As an added bonus, the data in web server logs allow for some other fringe benefits including stuff like:

  • the percentage or pattern of requests were “non interactive” (such as % of requests that are search crawls or SharePoint workspace sync’s)
  • identifying usage patterns of certain users (eg top 10 users and their usage usage patterns)

Finally, if you monitor CPU and disk performance, you can compare the RPS peaks against those other performance counters and then interpolate how things might have been in the past (although this has some caveats too).

Coming up next…

Okay so now you are convinced that RPS does not suck – and you want to get your hands on all this RPS goodness. The good news is that its fairly easy to do and Microsoft’s Mike Wise has documented the definitive way to do it. The bad news is, you have to download and learn a yet another utility. Fear not though as the utility (called LogParser) is brilliant and needs to be in your arsenal anyway (especially business oriented SharePoint readers of this blog – this is not one just for the techies). Put simply, LogParser provides the ability to do SQL-like queries to your log files. You can have it open a log file (or series of files), process them via a SQL style language, and then output the results of your query into different formats for reporting.

But, just as I have whetted your appetite, I am going to stop.This post is already getting large and I still have a bit to get through in relation to using LogParser, so I will focus on that in the next post.

Hopefully though at this point, you don’t totally hate RPS, have a much better idea of what RPS is and some of the issues of its use.

Thanks for reading

Paul Culmsee

www.hereticsguidebooks.com



« Previous PageNext Page »

Today is: Saturday 7 March 2026 -