Back to Cleverworkarounds mainpage
 

Jul 23 2012

Demystifying SharePoint Performance Management Part 8 – More on SQL and SQLIO

Well here we are at part 8 of my series on making SharePoint performance management that little bit easier to understand. What is interesting about this series is its timing. If by some minute chance that the marketing tsunami has passed you by at the time I write this, SharePoint 2013 public beta was released. Much is being made about its stated requirement of 24GB of RAM for a “Single server with a built-in database or single server that uses SQL Server”. While the reality is that requirements depends on what components that you are working with, this series of articles should be just as useful in relation to SharePoint 2013 as for any other version.

Now, if you have been following events thus far, we have been spending some time examining disk performance, as that is a very common area where a sub optimal configuration can result in a poor experience for users. In part 6, we looked at the relationship between the performance metrics of disk latency, IOPS and MBPS. We also touched on the IO characteristics (nerd speak for the manner in which something reads and writes to disk) of SQL Server and some SharePoint components. In the last post, we examined the windows performance counters that one would use to quickly monitor latency and IOPS in particular. We then finished off by taking a toe dip into the coolness of the SQLIO utility, that is a great tool for stress testing your storage infrastructure by pulverising it with different IO read and write patterns.

In this post, we will spend a bit of time taking SQLIO to the next step and I will show you how you can run a comprehensive disk infrastructure stress test. Luckily for the both of us, others have done the hard work for us and we can reap the benefits of their expertise and insights. First up however, I would like to kick things off by spending a little time showing you the relationship between SQLIO results and performance monitor counters. This helps to reinforce what the reported numbers mean.

Performance Monitor and SQLIO

In the previous post when we used Windows Performance Monitor, we plotted IOPS and Latency by watching the counters as they occurred in real-time. While this is nice for a quick analysis, nothing is actually stored for later analysis. Fortunately, performance monitor has the capability to run a trace and collect a much larger data set for a more detailed analysis later. So first up, lets use performance monitor to collect disk performance data while we run a SQLIO stress test. After the test has been run, we will then review the trace data and validate it against the results that SQLIO reports.

So go ahead and start up performance monitor (and consult part 7 of this series if you are unsure of how to do this). Looking at the top left of the performance manager, you should see several options listed under “Performance”. Click on “Data Collector Sets” and look for a sub menu called “User Defined”. Now right click on “User Defined” and choose “New –> Data Collector Set” as shown below:

image

This will start a wizard that will ask us to define what performance counters we are interested in and how often to sample performance. I have pasted screenshots of the sequence below (click to enlarge any particular one). First up we need to give a name to this collection of counters and as you can see below, I called mine “Disk IO Experiments”. Once we have given it a name, we have to choose the type of performance data we want to collect. Tick the “Performance counter” button and ensure the others are left unticked.

image  image

Next we need to pick what specific counters we need. We will use the same counters that we used in part 7, with the addition of two additional ones. To remind you of part 7, the counters we looked at were:

  • Avg. Disk sec/Read   – (measures latency by looking at how long in seconds, a read of data from the disk took)
  • Avg. Disk sec/Write  -  (measures latency by looking at how long in seconds, a write of data to the disk took)
  • Disk Reads/sec  -  (measures IOPS by looking at the rate of read operations on the disk per second)
  • Disk Writes/sec  – (measures IOPS by looking at the rate of write operations on the disk per second)

In addition to these counters, we will also add two more to the collector set

  • Avg. Disk Bytes/Read – (Measures size of each read request by reporting the number of bytes each used)
  • Avg. Disk Bytes/Write – (Measures size of each write request by reporting the number of bytes each)

We will use these counters to see if the size of the IO request than SQLIO uses is reported correctly.

Depending on your configuration, choose the PhysicalDisk or LogicalDisk  performance object (consult part 7 for the difference between PhysicalDisk and LogicalDisk). You will then find the performance counters I listed above. Before you do anything else, make sure that you pick the right disk or partition from the “Instances of selected object” section. We need to specifically pick the disk or partition that SQLIO is going to stress test. Now you select each of the aforementioned six performance counters and click the “Add” button. Finally, make sure that you pick the sample interval to be 1 second as shown below. This is really important because it makes it easy to compare to SQLIO which reports on a per second basis.

image  image

At this point you do not need to configure anything else, so click the “Finish” button, rather than the “next” button, and the collector should now be ready to go. It will not start by default, but since there is no fun in that, let’s collect some data. Right click on your shiny new data collector set and choose “Start”.

image  image

Once started, performance monitor is collecting the values of the six counters each and every second. Now let’s run a SQLIO command to give it something to measure. In this example, I am going to run SQLIO with random 8KB writes. But to make it interesting, I will use two threads and simulate 8 outstanding IO requests in the queue. If you recall by grocery check-out metaphor of part 6, this is like having 8 people with full shopping carts waiting in line for a single check-out operator. Since the guy at the back of the line has to wait for the seven people in front of him to be processed, he has to wait longer. So with eight outstanding IO requests, latency should increase as each IO request will be sitting in a queue behind the seven other requests.

By the way, if none of that made sense, then you did not read part 6 and part 7. I urge you to read them before continuing here, because I am assuming prior knowledge of SQLIO and disk latency characteristics and the big trolley theory..

Here is the SQLIO command and below is the result…

SQLIO –kW –b8 –frandom –s120 –t2 –o8 –BH –LS F:\testfile.dat

image

Now take a note of the results reported. IOPS was 526, MBs/sec was 4.11 and as expected, the average latency was much larger than the SQLIO tests we ran in part 7. In this case, latency was 29 milliseconds.

Let’s now compare this to what performance monitor captured. First up, return to Performance Monitor, and stop your data collector set by right clicking on it and choosing “Stop”. Now if you cast your eye to the top left navigation pane, you should see an option called “Reports” listed under “Performance”. Click on “Reports” and look for a sub menu called “User Defined”. Expand “User Defined” and hey presto! Your data collector set should be listed…

image  image

Expand the data collector set and you will find a report for the data you just collected. The naming convention is the server name and the date of the collection. Click on this and you will then see the performance data for that collection in the right pane. At the bottom you can see the six performance counters we chose and just by looking at the graph, you can clearly see when SQLIO started and stopped.

image

Now we have to do one additional step to make sure that we are comparing apples with apples. Performance monitor will calculate its averages based on the total time displayed. As you can see above, I did not run SQLIO straight away, but the performance counters were collected each second nonetheless. Therefore we have a heap of zero values that will bring the averages down and mislead. Fear not though, it is fairly easy (although not completely obvious) to zoom into the time we are interested in. If you look closely, just below the performance graph, where the time is reported, there is a sliding scale. If you click and drag the left and right boxes, you can highlight a specific time you are interested in. This will be shown in the performance graph too, so using this tool, we can get more specific about the time we are interested in. Then in the toolbar above the graph, you will see a zoom button. Click it and watch the magic…

image  image

As you can see below, now we are looking at the performance data for the period when the SQLIO was run. (Now it should be noted that windows performance monitor isn’t particularly granular here. I had to fiddle with the sliding scale a couple of times to accurately set the exact times when SQLIO was started and then stopped.)

image

Now let’s look at the results reported by performance monitor. The screenshot above is looking at the number of Disk Writes per second. Let’s zoom into the figures for the time period and example the average result over the sample period. To save you squinting, I have pasted it below and called out the counter in question. Performance monitor has reported average “Disk Writes/Sec” as 525.423. This is entirely consistent with SQLIO’s reported IOPS of 526.

image

Latency (reported in seconds via the counter Avg. Disk sec/Write) is also fairly consistent with SQLIO. The figure from performance monitor was 0.03 seconds (30 milliseconds). SQLIO reported 29 milliseconds.

image

What about IO size? Well, that’s what Avg disk bytes/write is for… Let’s take a look shall we? Yup.. 8192 kilobytes, which is exactly the parameters specified.

image

SQL IO characteristics revisited (and an awesome script)

Now that we understand what SQLIO is telling us via examining windows performance monitor counters, I’d like to return to the topic of SQL IO patterns. Back at the end of part 6, I spent some time talking about SQL and SharePoint IO characteristics. As a quick recap, I mentioned SQL reads and writes to databases via 8KB pages. Now based on me telling you that, you might assume that if you had to open a large document from SharePoint (say 1MB  or 1024KB), SQL would make 128 IO requests of 8KB each.

While that would be a reasonable assumption, its also wrong. You see, I also mentioned that SQL Server also has a read-ahead algorithm. This algorithm means that means SQL will try and proactively retrieve data pages that are going to be used in the immediate future. As a result, even though a single page is only 8KB, it is not unusual to see SQL read data from disk in a much wider range if it thinks the next few 8KB pages are likely to be asked for anyway. Now as an aside, if you are running SQL Enterprise edition, the possible read-ahead range is from 1 to 128 pages (other editions of SQL max out at 32 pages). Assuming SQL Enterprise edition, this translates to between 8KB and 1024KB for a single IO operation. Think about this for a second… based on the 1MB document example I used in the previous paragraph, it is technically possible that this could be serviced with a single IO request by an enterprise edition of SQL server.

Okay, so essentially SQL has varying IO characteristics when it comes to reading from and writing to databases. But there is still more to it. This is because there are a myriad of SQL IO operations that we did not even consider in part 6. As an example, we have not spoken about the IO characteristics of how SQL writes to transaction logs (which is sequential as opposed to random IO, and does not use 8k pages at all). Another little known fact with transaction logs is that SQL has to wait for them to be “flushed to stable media” before the data page itself can be flushed to stable media. This is known as Write Ahead Logging and is used for data integrity purposes. What is means though is that if logging has a lot of latency, the rest of SQL server can potentially suffer as well (and if it was not obvious before, yet another good reason why people recommend putting SQL data and log files on different disks).

Now I am not going to delve deep into SQL IO patterns any more than this, because we are now getting into serious nerdy territory. However what I will say is this: by understanding the characteristics of these IO patterns, we have the opportunity to change the parameters we pass to SQLIO and more accurately reflect real-world SQL characteristics in our testing. Luckily for all of us, others have already done the hard work in this area. First up, Bob Duffy created a table that summarises SQL Server IO patterns based on the type of operations being performed. Even better than that… Niels Grove-Rasmussen wrote a completely brilliant post, where not only did he list the IO patterns that SQL is likely to exhibit, he wrote a PowerShell script that then runs 5 minute SQLIO simulations for each and every one of them!

I have not pasted the script here, but you will find it at Niels article. What I will say though is that aside from the obvious 8KB random reads and writes that we have concentrated on thus far, Niels listed several other common SQL IO patterns that his SQLIO script tests:

  • 1 KB sequential writes to the log file (small log writes)
  • 64 KB sequential writes to the log file (bulk log writes)
  • 8 KB random reads to the log file (rollbacks)
  • 64 KB sequential writes to the data files (checkpoints, reindex, bulk inserts)
  • 64 KB sequential reads to the data files (read-ahead, reindex, checkdb)
  • 128 KB sequential reads to the data files (read-ahead, reindex, checkdb)
  • 128 KB sequential writes to the data files (bulk inserts, reindex)
  • 256 KB sequential reads to the data files (read-ahead, reindex)
  • 1024 KB sequential reads to the data files (enterprise edition read-ahead)
  • 1 MB sequential reads to the data files (backups)

The script actually handles more combinations than those listed above because it also tests for differing number of threads (-t ) and outstanding requests (-o ). All in all, over 570 combinations of IO patterns are tested. Be warned here… given that each test takes 5 minutes to run by default, with a 60 second wait time in between each test, be prepared to give this script at least 2 days to let it run its course!

The script itself is dead simple to run. Just open a powershell window, and save Niels script to the SQLIO installation folder. From there, change to that directory and issue the command:

./SQLIO_Batch.ps1

Then come back in 3 days! Seriously though, depending on your requirements, you can modify the parameters of the script to reduce the number of scenarios based on editing the first 7 lines of code which is quite self explanatory

$Drive = @('G', 'H', 'I', 'J')
$IO_Kind = @('W', 'R') # Write before read so that there is something to read.
$Threads = @(2, 4, 8)
#$Threads = @(2, 4, 8, 16, 32, 64)
$Seconds = 10*60 # Five minutes
$Factor = @('random', 'sequential')
$Outstanding = @(1, 2, 4, 8, 16, 32, 64, 128)
$BlockSize = @(1, 8, 64, 128, 256, 1024)

Now if this wasn’t cool enough, Niels also written a second script that parses the output from all of the SQLIO tests. This can produce a CSV file that allows you to perform further analysis in excel. To run this script, we need to know the same of the output file of the first script. By default the filename is SQLIO_Result.<date>.txt. For example:

./SQLIO-Parse.ps1 -ResultFileName ‘SQLIO_Result.2010-12-24.txt’

By default the parse script outputs to the screen, but modifying it to write to CSV file is really easy. All one has to do is comment out the second last line of code and uncomment the last one as shown below:

#$Sqlio | Format-Table -Property Kind,Threads,Seconds,Drive,Stripe,Outstanding,Size,IOs,MBs,Latency_min,Latency_avg,Latency_max -AutoSize

$Sqlio | Export-Csv SQLIO_Parse.csv

Below is an example of the report in Excel. Neat eh?

image

Conclusion and coming up next…

By now, you should be a SQLIO guru and have a much better idea of the sort of IO patterns that SQL Server has beyond just reading from and writing to databases. We have covered the IO patterns of transaction logs, as well as examined a terrific PowerShell script that not only runs all of the IO scenarios that you need to, but parses the output to produce a CSV file for deeper analysis. In short, you now have the tools you need to run a pretty good disk infrastructure stress test and start some interesting conversations with your storage gurus.

However at this point I feel there some pieces missing to this disk puzzle:

  1. We have not yet brought the discussion back to lead and lag indicators. So while we know how to hammer disk infrastructure, how can we be more proactive and specify minimum conditions of satisfaction for our disk infrastructure?
  2. Microsoft treatment of disk performance (and in particular IOPS and latency) in their performance documentation is inconsistent and in my opinion, confuses more than it clarifies. So in the next post, we are going to look at these two issues. In doing so, we are going to leave SQLIO and Performance Monitor behind and examine two other utilities including one that is lesser known, but highly powerful.

Until then, thanks for reading

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jul 16 2012

Demystifying SharePoint Performance Management Part 7 – Getting at Latency, IOPS and MBPS

Hi all, and welcome to part 7 of this series on SharePoint performance planning. This is the point of the series where I realise that I have much more to write about than I intended. Last time this happened I never got around to finishing the series (*blush* … a certain tribute to a humble leave form ). Like that series, I now have no idea how many posts I will end up doing, but I will keep soldiering on nonetheless.

Recapping the last two posts of this series in particular, we have been looking at the relationship between the performance measures of Disk latency, Disk I/O per second (IOPS) and Disk Megabytes transferred per second (MBPS). We spent most of part 6 looking at the relationship between these three performance metrics by specifically focusing on how the size of an IO request affects things. If you recall, a couple of key points were made:

  • In general, the larger the IO request being made, the more latency there will be, resulting in less IOPS but increased MBPS.
  • Latency is significantly affected by whether an IO requests is sequential or random. To demonstrate this, I used a tool called SQLIO to simulate disk IO to generate some performance stats that demonstrated both IOPS and MBPS improved by some 750% when compared to random IO.

We finished the post by examining the way SQL server performs IO requests and what SharePoint components are IOPS heavy. In short, SQL Server uses a range of request sizes for database reads and writes between 8K and 1024KB. The reason for the range (for reads anyhow) is the read-ahead algorithm (gory detail here), in which SQL attempts to proactively retrieve data that are going to be used in the immediate future. A read-ahead may result in a much larger I/O request being made than a single 8KB page, but much better performance because in effect, SQL is pulling more data from each I/O operation.

In this episode (and the next one)…

Our focus in this post and the next one is similar to part 3, in that we are now going to do some real work and some of it will involve the command line. Therefore also like part 3, if you are one of those project manager types who utilise the wussy “I’m business, not technical” excuse, I want you to persist and try this stuff out. Given that I wrote this series with you in mind, put that damn iPad down, get out your laptop and reload this article! You can try all of the steps below out on your PC while you are reading this.

Now for the tech types reading this, on account of my intention to “demystify” SharePoint performance, I will be more verbose that what you guys need. But consider it this way – I am doing you guys a favour because next time your PM or BA’s eyes start to glaze when you start explaining performance and capacity planning to them, you can point them to this series and tell them that there is no excuse.

This article is going to cover two areas. First up let’s look at what we can do with Windows inbuilt Performance Monitor tool in terms of monitoring Latency and IOPS in particular. Next we will look at a popular tool for stress testing disk infrastructure that gives us visibility into MBPS.

The basics: Performance Monitor 101

Just in case you have never done it before, type in PERFMON on any Windows box at the start button or the command line (by the way, I am assuming Windows 7 or Windows 2008 Server here).

image

If you did that, then you are looking at the classic tool used to understand how a PC or server is performing. Looking at the top left of the resultant window, you should see several options listed under “Performance”. Click on “Performance Monitor” and watch the magic. Congratulations… you now know how to measure CPU as that is the default performance counter displayed.

image

Performance monitor can easily be used to take a look at disk IOPS and latency. Right click on the graph and from the menu choose Add Counters… This will provide you with a long list of “performance objects” (a fancy word for a logical grouping of performance counters)

image

From the list of performance objects, scroll up and find “LogicalDisk”. Move your cursor to the arrow to the right of the LogicalDisk counters and click on it. You should see a list of disk related performance counters appear as shown below.

image   image

Note:  You could have chosen the performance object called PhysicalDisk instead of LogicalDisk. The difference between them is that physical disk counters only consider each hard drive, not the way it is partitioned. The Logical Disk performance object monitors logical partitions of a disk. As a general role (for non techy types reading this), go with LogicalDisk.

Right then… now currently, all of the possible performance counters for LogicalDisk are currently selected, but for now we are only interested in latency and IOPS, which are represented by four counters:

Latency: Avg. Disk sec/Read
Avg. Disk sec/Write
Measures the average time, in seconds, of a read of data to the disk. (Therefore 5ms will be shown as 0.005)
Measures the average time, in seconds, of a write of data to the disk

MS Technet Note: Numbers also vary across different storage configurations (SAN cache size/utilization can impact this greatly)
IOPS Disk Reads/sec:
Disk Writes/sec:
The rate of read operations on the disk per second.
The rate of write operations on the disk per second.
MS Technet Note: This number varies based on the size of I/O’s issued. Practical limit of 100-140/sec per disk spindle, however consult with hardware vendor for more accurate estimation.

Go ahead and select these four counters (use the Ctrl key and click each one to select more than one counter). Now we have to choose which disk or partition that we want to monitor. Below where you chose the performance counters, you will see a label with the suitably unclear title of “Instances of selected object” (I have highlighted it below). From here, choose the hard drive or partition you are interested in. Finally, click the “Add” button at the very bottom and you should see your selected counters listed in the “Added counters” window.

image   image

Click the OK button and you should now be seeing these counters doing their thing. Each performance counter you added is listed below the graph showing the performance data collected in real time. The display shows a time period of 100 seconds and is refreshed each second. Also, a neat feature that some people don’t know about it is to click on one of the counters and then hold down Ctrl and type the letter “H”. This is the shortcut key for highlighting the selected counter and the currently selected counter should now be black. Additionally, you should be able to now use the up and down arrow keys to cycle through the counters and highlight each.

image

At this point, try copying some files or open some applications and watch the effect. You should see a spike in disk related activity reflected in the IOPS and latency counters above. There you go business analysts, you officially have monitored disk performance! Wasn’t so hard was it?

Now that we are monitoring some interesting counters, how about we really give the disk something to chew on! Smile

Upping the ante with SQLIO

SQLIO is an old tool nowadays, but still highly relevant and extremely useful. Despite being named SQLIO, it actually has very little to do with SQL Server! It was provided by Microsoft to help determine the I/O capacity that a server can handle. SQLIO allows you to test a combination of I/O sizes for read/write operations, both sequentially and randomly. Thus, it is useful for stress testing the disk infrastructure for any IO intensive application. Now be warned… you can absolutely smash your disk infrastructure with this tool, so don’t go running this in production without some sort of official clearance. Furthermore, if you want to use SQLIO to test your SAN, be sure to consider the other servers and applications that might be using it. There is potential to adversely affect them.

You can download SQLIO from Microsoft here. It will run on any recent Windows OS, so you can try it on your own PC (now you know why I told you to put your iPad away earlier). Installing SQLIO is very simple, just run SQLIO.MSI and it will install by default into C:\Program Files(x86)\SQLIO folder.

Note: If you want a great tutorial on installing and using SQLIO, look no further than MCM Brent Ozar’s 2009 article entitled SQLIO Tutorial: How to Test Disk Performance).

SQLIO works by reading from and writing to one or more test files, so the first thing we need to do with SQLIO is to set up a configuration file that specifies the location and size of these test files. The configuration file is called PARAM.TXT and is found in the installation folder. Each line of the configuration file represents a test file, its size and a couple of other parameters. The options on each line of the param.txt file are as follows:

  • <Path to test file> Full path and name of the test file to be used.
  • <Number of threads (per test file)>
  • <Mask > Set to 0×0
  • <Size of test file in MB> Ideally, this should be large enough so that the test file will be larger than any cache resident on the SAN (or RAID controller).

Of these 4 parameters, only the first one (the location of the file) and last one (the size of the file) matters for now. Below is a sample param.txt that tests a 20GB file on the E:\ Drive.

image

The next step is to run a  quick SQLIO using sequential writes to create the test file. We are going to use the command-line to do this (although someone has written a GUI for the tool). So open a command prompt, change to the installation directory for SQLIO and type the command below (we will save an detailed explanation of the parameters for later).

sqlio -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt timeout /T 10

This command will create the file and run a 10 second test. The output will look something like what I have pasted below:

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

parameter file used: param.txt

     file e:\testfile.dat with 1 thread (0) using mask 0×0 (0)

1 thread writing for 10 secs to file e:\testfile.dat

     using 8KB sequential IOs

     enabling multiple I/Os per thread with 8 outstanding

size of file e:\testfile.dat needs to be: 20971520000 bytes

current file size:      104857600 bytes

need to expand by:      20866662400 bytes

expanding e:\testfile.dat …

SQLIO will stop here for a while, while your PC chugs away creating the 20GB test file. Once completed, it will run out quick 10 second test, but you can ignore the rest of the output because this test is  of no consequence.

Running a real test

The previous command was just the entre. We are not interested in the resulting data because the point of the exercise was to create the test file. Now it is time for the main course. Try this command. It will spend 2 minutes running a random IO write to the 20gig test file using a size of 8KB for each write.

sqlio -kW -b8 -frandom -s120 -BH -LS -Fparam.txt

Below is the output that summarises the configuration specified by the above command:

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

1 thread writing for 120 secs to file e:\TestFile.dat

using 8KB random IOs

buffering set to use hardware disk cache (but not file cache)

using current size: 20000 MB for file: e:\TestFile.dat

initialization done

For the next two minutes SQLIO will chug away, hammering the disk with writes. Once the test has been performed, SQLIO will report its findings. You will see IOPS, MBPS and a report of average/max/min latency. On top of this, a histogram showing the distribution of latency is provided. This histogram gives context to the “average latency” figure because it shows the shape of the latency that occurred throughout the test. I graphed the distribution in excel below the SQLIO results below:

CUMULATIVE DATA:

throughput metrics:

IOs/sec:   225.80

MBs/sec:     1.76

latency metrics:

Min_Latency(ms): 0

Avg_Latency(ms): 3

Max_Latency(ms): 111

histogram:

ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+

%:  4  6  6 31 23 15  5  3  2  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0

image

Running the numbers…

Now, before we get into a more detailed test, let’s examine some of the SQLIO parameters:

  • -k specifies whether to perform a read or write test (–kW for write and –kR for read)
  • -s specifies how long to run the test for. In the example above it ran for 2 minutes (120 seconds)
  • -f specifies whether to run a random or sequential IO operation (-frandom)
  • -b specifies the size of the IO operations (in the example above 8KB)
  • -t specifies the number of threads to use. A multi-cpu server should be able to utilise more threads than you have processors. If your storage can handle it, we can increase the number of threads and see what latency arises as a result.
  • -o specifies the number of outstanding requests. This simulates a sudden spike in load and gives an indication of how fast IO requests are being serviced. If you keep adding outstanding requests, latency will start to increase as the number of IO requests outstrips the disks ability to service them.
  • -LS means to capture the disk latency information. If you do not specify this you will not get any latency results

Okay, so how about we see what difference a queue of IO requests makes. Below is a SQLIO command with the addition of the –o parameter. Let’s see what a queue of 4 outstanding requests does and compare the historgram output…

sqlio -kW -b8 -frandom –s120 –o4 -BH -LS -Fparam.txt

And the result? Much more latency than our first example above but no real increase in IOPS or MBPS. Clearly we are already at the limit of what my laptop can handle (I stripped the hyperbole and pasted the counters only).

IOs/sec:   221.73

MBs/sec:     1.73

Min_Latency(ms): 0

Avg_Latency(ms): 17

Max_Latency(ms): 187

 

image

Now since I only changed 1 parameter and such a difference was seen, most people will use SQLIO with a batch file to test different parameters. For example, if you were to paste the commands below into a batch file, you would be running write tests using 16KB, 32KB and 64KB sizes.

sqlio -kW -b16 -frandom -s120 -BH -LS -Fparam.txt

sqlio -kW -b64 -frandom -s120 -BH -LS -Fparam.txt

sqlio -kW -b128 -frandom -s120 -BH -LS -Fparam.txt

For what it’s worth, here is the results for each of the above tests (including the 8KB one we stared with) showing the relationship of IOPS, MBPS and latency. As predicted by our exploration of the relationship between request size, IOPS and MBPS in part 6, latency was smallest with the 8KB option.

8KB write 16KB write 64KB write 128KB write
IOs/sec: 225.80

MBs/sec: 1.76

Avg_Latency(ms): 3

IOs/sec: 220.39

MBs/sec: 3.44

Avg_Latency(ms): 4

IOs/sec: 192.85

MBs/sec: 12.05

Avg_Latency(ms): 4

IO/Sec: 176.30

MB/sec: 22.02

Avg Latency(ms): 5

image

Now one quick note: If you want to play with the –t parameter and add more threads, you will  have to reference the test file directly and not refer to the parameters file. This is because the one of the settings in the param.txt file is the number of threads for each file. Not matter what you put in at the command line, it will be overwritten by what is specified in param.txt. Thus the command below would only run a single thread despite 8 threads being specified via the –t parameter.

sqlio -kW -b64 -frandom -s120 -t8 -o1 -BH -LS -Fparam.txt

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

parameter file used: param.txt

file c:\testfile.dat with 1 thread (0) using mask 0×0 (0)

 

To get around this issue, drop the –F parameter and refer to the test file directly as shown below:

sqlio -kW -b64 -frandom -s120 -t8 -o1 -BH -LS e:\testfile.dat

sqlio v1.5.SG

using system counter for latency timings, 2241035 counts per second

8 threads writing for 120 secs to file e:\testfile.dat

 

Conclusion (and coming up next)…

Phew! Okay, so apart from possibly whetting your appetite for smashing disk infrastructure, you might have also come to the realisation that there are many parameters to test in various combinations. In this entire article, I have assumed random writes to the disk, but what about sequential writes? For that matter, what about reads? What about multiple threads and more outstanding requests? What about longer tests or different sized test files?

These are all important questions and I will answer them… in the next post or two. This one is getting a little too long and I have plenty more to cover in this area.

So have a play with the parameters on SQLIO on your own hardware and in the next post, we will continue looking at SQLIO, plus some great work people have done to make your life much easier using it. I want to also return to PERFMON to show you the relationship between the PhysicalDisk and LogicalDisk counters and what SQLIO reports. Then we will examine two other tools, including one that is lesser known, but a very powerful way to measure disk performance. (That one will redeem me with the tech guys who will have no doubt found this article to be too light on :)

Subsequent to that, we hark way back to part 1 and return to a lead indicator point of view of disk IO performance and look at how you can nail the ass off your SAN vendor to ensure they do all the due diligence necessary that your Disk infrastructure will perform well.

Until then, thanks for reading

Paul Culmsee

HGBP_Cover-236x300

www.sevensignma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jun 09 2012

Demystifying SharePoint Performance Management Part 4 – Making use of RPS

Hi all. Welcome to part 4 of a rapidly growing series of posts on trying to take some of the mystery out of SharePoint performance management. Essentially I am trying to write a sort of preamble to the existing Microsoft resources which are extremely comprehensive and full of wisdom, but suffer from being rather large and a lot to get through. To remind you, we have the 307 page “Planning guide for server farms and environments for Microsoft SharePoint Server 2010,” the 367 page “Capacity Planning for Microsoft SharePoint Server 2010” and the lesser known, but equally excellent 23 pages of “Analysing Microsoft SharePoint Products and Technologies Usage” whitepaper.

My hope that this series establishes just enough groundwork for someone to find the aforementioned documents an easier read and get more out of them.

Now this series is starting to turn out like the “Humble Tribute to the Leave Form” series, which I never actually finished (*blush*). Basically the number of posts to complete it exceeded the time I had available to write it (and my interest shifted to other things). For this topic of performance, I originally thought this series might be 4 posts but we are now at post 4 and haven’t actually gotten off the Requests Per Second (RPS) performance counter yet.

So let’s cracking…

Command line alert (again)

I have a tendency to have fun at the expense of IT stereotypes in my posts, and in the interests of fairness, I turned this around in part 3 I and took the piss out of the “I’m business, not technical” wusses instead. You all know who you are… you tend to shun anything that involves the command line as if it was the most complex thing ever. So continuing in that vein, if I managed not to completely scare the crap out of you in the last post, you should have the excellent log parser utility installed and have created a file called LogWithSeconds.csv.  If you have not, go back and read part 3. To remind you quickly, the log parser command that we used to generate the LogWithSeconds.csv file was:

logparser -i:IISW3C file:GetSeconds.txt?startdate=’2011-11-15′+enddate=’2011-11-15′ -o:csv >LogWithSeconds.csv

The key point being that you can specify a date range for the logs you want to process.

For the rest of this article, we continue to play in the command-line playground and utilise some different logparser scripts to derive some useful information. In addition, we will utilise a bit of PowerShell, as well as check out another great free utility written by Nikander & Margriet Bruggeman (more on that later).

Also at this point I need to call out and credit the excellent work of Mike Wise. His aforementioned 2009 whitepaper called“Analysing Microsoft SharePoint Products and Technologies Usage” is the basis for what I cover here. I urge you to download and read this article as it goes into more detail on Log Parser and its uses beyond just RPS alone. Although I have based my stuff off Mike’s work, there are some differences that you will see as we progress through this article.

Distribution of RPS

The one thing that past examination of RPS can give you is a distribution of requests over time. Understanding the distribution (or shape of RPS) helps you to identify patterns to SharePoint use, such as peak or heavy usage times. To that end, the first log parser script will generate a CSV file that can be imported into a tool like excel to chart the distribution of RPS. The log parser script below has been modified from one in Mike’s document, because he assumes you are only looking at 24 hours of log data. In my case, I assume that you might want to profile more than 24 hours (essentially the date range specified in the log parser command above).

The command to generate a per-second RPS distribution is below. The only difference between my script and the one Mike did is I added the “date” field to the SQL to account for multiple days:

logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss from e:\temp\LogWithSeconds.csv group by date,secs order by date, secs” -q >RPSDistribution.csv

This command will create a new CSV file called RPSDistribution.CSV that contains the count of requests at any given second during the specified date range. So let’s open RPSDistribution.CSV into Excel and create a chart (I assume you know how to do that). Here is what it looks like…

image

Now I wonder if you can spot the issue with this chart? If you look closely, note that the times are not evenly spaced. This occurs because the generated file (RPSDistribution.CSV) only contains entries for the seconds during the day where there were requests. If no requests were made, then nothing was recorded. This skews the graph because if we want to see the distribution of requests, we also need to know the seconds of the day where there were zero requests. The graph you see above has effectively squeezed out all of the quiet times.

To work around this issue, I wrote the following PowerShell script. For you non-programmers, I am not going to explain all of the gory logic of this script, but just be assured that it adds entries stating zero RPS for every second of the day where there were no requests made. This will normalise the data across time and make a much more meaningful graph.

(If this is starting to hurt your brain, stick with me… paste the code below into notepad, save it in the Log parser installation folder and call it AddNulsPerSec.ps1)

param([string]$inputcsv, [string]$outputcsv = "output.csv")
if (!$inputcsv) {
    write-host "The -inputcsv parameter has not been specified. Script cannot run without it";
    exit;
}
if (Test-Path -path $outputcsv) { remove-item $outputcsv }
$x = 0;
$y = import-csv $inputcsv
write-output "ct,date,secs,minu,hh,mm,ss" | add-content -path $outputcsv
$y | foreach {
    if ($x -gt 86399) { $x = 0 }
    $s = [int]$_.secs;
    while ($s -gt $x) {
        $d = [datetime]$_.date;
        $d=$d.AddSeconds($x)
        $ss = $d.tostring("ss")
        $mm = $d.tostring("mm")
        $hh = $d.tostring("HH")
        $minu = [int]$hh * 60 + [int]$mm
        $output = "0" + "," + $_.Date + "," + $x + "," + $minu + "," + $ss + "," + $mm + "," + $hh
        write-output $output
        $x++;
   }
   $output = $_.ct + "," + $_.Date + "," + $_.secs + "," + $_.minu + "," + $_.ss + "," + $_.mi + "," + $_.hh
   write-output $output
   $x++;
} | add-content -path $outputcsv

The above script takes two command-line parameters: inputCSV and outputCSV. inputCSV is the file name to process and outputCSV is the resulting file with the 0 RPS entries added. Note that to run this script you will need to use a PowerShell window, rather than a command prompt. Below is the command I used:

PS C:\Program Files (x86)\Log Parser 2.2> .\AddNulsPerSec.ps1 -inputcsv RPSDistribution.CSV -outputcsv RPSDistributionNormalised.CSV

This created the file RPSDistributionNormalised.CSV. I charted this file in Excel and we now have a time-normalised distribution. Take a look at the X axis. This looks more logical now as the times are more evenly spaced. It seems from looking at this, that peak times are between 10am-11am, although one could argue that a lot of the day was fairly busy, with a bit of a lull between 2 and 3pm.

image

So what else can we do?

Right, so apart from the utility of being able to get a sense of when there are a lot of requests versus quiet times, can we find out anything else useful? Much insight can be gleaned from Mike Wise’s document, so here I will cover a couple of things specific to RPS.

RPS distribution for certain users

Let’s go back to the LogWithSeconds.CSV we started with and find out the top users for the period being examined. In the log parser command below we are grouping users by total requests they made, ordering from largest to smallest..

logparser -i:csv “select top 20 count(*) as ct,cs-username as user from LogWithSeconds.csv group by user order by ct desc”

A snippett of the output from this command is below:

ct  user
--- --------------------
840 DOMAIN\Jame.Smith
688 DOMAIN\searchcrawler
614 DOMAIN\Ian.Jones
508 DOMAIN\Steve.Hill
357 DOMAIN\Ant.Cough
313 DOMAIN\dom.davies
260 DOMAIN\matthew.martin

Hmm, I notice that the search crawler account (DOMAIN\searchcrawler) was busy during that day. It appears to have made the second largest number of requests. How about we work out when the search crawler was active by filtering the requests just for this user. Perhaps search crawls are active during peak times and introducing unnecessary load on the server?

First up, lets create the RPS distribution, but this time just for the search crawler account (note the SQL WHERE clause in the command below)

logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss,cs-username as user from LogWithSeconds.csv where user=’DOMAIN\searchcrawler’  group by user, date,secs order by date, secs” -q>crawler.csv

Now we need to pad CRAWLER.CSV out with 0 entries to time-normalise it for the seconds in which it wasn’t active…  back to my PowerShell script…

PS C:\Program Files (x86)\Log Parser 2.2> .\AddNulsPerSec.ps1 -inputcsv crawler.csv -outputcsv CrawlerNormalised.csv

I then took the results from CrawlerNormalised.csv and added them to my previous RPS distribution chart in Excel. Straight away you can see the incremental crawl schedule of this SharePoint installation is 5 hourly. (Note the red lines at regular intervals)

image

RPS Distribution for certain clients…

Another use for RPS is to see the pattern of the various applications that interact with SharePoint. Aside from the trusty old browser, we also have Office clients, Windows Explorer, SharePoint Workspace, and 3rd party tools like SharePlus. All of these applications identify themselves to SharePoint via the use of something called the user-agent [stored in the LogWithSeconds.CSV file in a column called cs(user-agent)]. The user agent field is actually part of the HTTP standard and not SharePoint specific, but let’s take advantage of it…

logparser -i:CSV “select count(*) as ct,cs(user-agent) from LogWithSeconds.CSV group by cs(user-agent) order by ct desc” -q >BrowserList.csv

Now, I am not going to paste the complete output of running this command because unfortunately, browsers have a lot of variation in their user agent string. Nevertheless, here are some of results from the BrowserList.csv file…

867 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1;+Trident/4.0;+.NET+CLR+1.1.4322;+.NET+CLR+2.0.50727;+.NET+CLR+3.0.04506.30;+.NET+CLR+3.0.04506.648;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729;+yie8)
688 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+NT;+MS+Search+6.0+Robot)
386 harmon.ie+for+Notes
333 Mozilla/5.0+(Windows;+U;+Windows+NT+6.1;+en-US;+rv:1.9.2.24)+Gecko/20111103+Firefox/3.6.24
250 SharePlus+2.9.5+(iPad;+iPhone+OS+4.2.1;+en_AU)
99  MSFrontPage/12.0
95  Mainsoft+SharePoint+Integrator
68  Microsoft+Office/14.0+(Windows+NT+6.1;+OWSSUPP+14.0.6112;+Pro)
34  Microsoft-WebDAV-MiniRedir/6.1.7600
24  Microsoft+Office+Outlook+2010+(14.0.6112)+Windows+NT+6.1
23  Microsoft+Office+Sharepoint+Workspace+2010+(14.0.6112)+Windows+NT+6.1
4   MobileSafari/7534.48.3+CFNetwork/548.0.3+Darwin/11.0.0

Now looking at these strings, it doesn’t take long to get a sense of the different ways SharePoint has been accessed. How about we generate a distribution of RPS for all iPad devices or apps? Here’s the log parser command along with my time normaliser script.

logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss,cs(User-Agent) as ua from LogWithSeconds.CSV where ua like ‘%iPad%’ group by ua, date,secs order by date, secs”-q >iPads.csv

.\AddNulsPerSec.ps1 -inputcsv E:\temp\ipads.csv -outputcsv e:\temp\iPadNormalised.csv

… and the result when added into Excel? iPads were definitely the evening tool of choice that day! Note the green spikes around 9-10pm.

image

Taking it further…

I won’t do much more on RPS now. Hopefully I have given you enough to do much more clever things than I have covered. As I stated in part 2, the nice thing about RPS is that it can be derived from web server logs and these tend to go back quite far in time. Given that the core sequence of events to produce the graphs above are essentially 3 scripts and can be done quickly, it becomes quite easy to sample different points in history. For example: Let’s say that we want to compare the 15th of November 2011 with the 10th of March 2012 to see whether there is an increase/decrease in requests and what this looks like. All we have to do is change the date, re-run the scripts and do some charting magic.

  • logparser -i:IISW3C file:GetSeconds.txt?startdate=’2012-3-10′+enddate=’2012-3-10′ -o:csv >LogWithSeconds.csv
  • logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss from LogWithSeconds.csv group by date,secs order by date, secs” -q >RPSDistribution.csv
  • AddNulsPerSec.ps1 -inputcsv RPSDistribution.CSV -outputcsv RPSDistributionNormalised.CSV

We can look at the distribution for an entire week is we wanted to…

  • logparser -i:IISW3C file:GetSeconds.txt?startdate=’2012-5-7′+enddate=’2011-5-11′ -o:csv >LogWithSeconds.csv
  • logparser -i:CSV -o:CSV “select count(*) as ct,date as Date, secs,max(hh) as hh,max(mi) as mi,max(ss) as ss from LogWithSeconds.csv group by date,secs order by date, secs” -q >RPSDistribution.csv
  • AddNulsPerSec.ps1 -inputcsv RPSDistribution.CSV -outputcsv RPSDistributionNormalised.CSV

Also in case you didn’t notice in part 3, my GetSeconds.txt logparser script that performs the initial processing of the IIS logfiles, also stores the minute of the day, as well as seconds of the day. This allows you to perform all of the same things I have specified in this article except it can be Requests Per Minute (RPM) instead. This would allow you to work with a larger date range without such big files (provided RPM is appropriate for what you want). Consult the “Analysing Microsoft SharePoint Products and Technologies Usage” whitepaper for more information on logparser queries for RPM scenarios.

Remember that the virtues of your web server logs go much further than RPS. We saw a hint of this in my examples of showing RPS for just one user or one device type. But this really is just scratching the surface of what can be gleaned via logparser. There are many excellent logparser scripts around, and a quick google search should give you plenty of examples.

Also remember that there are many more sophisticated ways to process this sort of data. For example, putting it into Analysis Services and slicing/dicing it via PowerPivot, or using something like RRDTool. To that end, I would also be seriously remiss if I did not make you aware of the SharePoint Flavored Weblog Reader tool. It was created by Nikander & Margriet Bruggeman who run the SharePoint Dragons blog – probably the best SharePoint performance related blog out there. This tool was specifically designed to make it easier to analyse IIS logs for SharePoint specific information. It is a command line tool, but much simpler and slicker than the methods I introduced in this post. Instead of specifying a date range you specify the number of items from the logs to process. For example:

sfwr.exe 250,000 “E:\LOGS\IIS_WWW\W3SVC1045333159”

Here are some of the things it reports on for you straight out of the box:

  • Most busy days of the week, most requested pages and requested pages per day.
  • The average, max and min request times per URI, InfoPath URI and Report Server URI.
  • Browser percentages, dead links, failed pages, percentage of error page requests.
  • Requests per hour per day, requests per hour, requests per user (also per week and month)
  • Slowest requests, top requests per hour, top visitors.
  • Traffic per day and per week in MB.
  • Unique visitors per day, week and month

Reflections…

At the end of the day, while examining the pattern of RPS can be very handy in offering insights into how your web application (SharePoint or otherwise) performs, as a lead indicator it is always going to be fairly wishy washy. As soon as you turn your attention to the future, many variables come into play that you cannot predict as accurately as you’d like. Your existing webserver logs can offer you a lot of ways to help make a more informed prediction, but at the end of the day, one has to take into account the new unique features of SharePoint 2010, how they will be used and so forth.

I will be returning to this theme, once we examine some other performance indicators, but hopefully at this point, you might find some aspects of Microsoft’s Capacity planning for Microsoft SharePoint Server 2010 guide less intimidating. Page 43 in particular has some great material that builds upon what we cover here. To quote Microsoft…

Understanding the distribution of the requests based on the clients applications that are interacting with the farm can help predict the expected trend and load changes after migrating to SharePoint Server 2010. As users transition to more recent client versions such as Office 2010, and start using the new capabilities new load patterns, RPS and total requests are expected to grow

I will leave you with a terrific example a graph that Microsoft created using IIS logs (on page 44 of the aforementioned document). This is a view of a typical day in an internal Microsoft environment serving what is deemed “a typical social solution”. It shows just how much additional load a new feature can introduce (in this case, Outlook Social Connector feature is 6.2% of the total number requests). The combination of different clients on the X axis, with the % distribution of overall and per-user requests on the Y axis is really handy.

image

Coming up next…

At this point I think we have covered RPS sufficiently, and dovetailed in nicely to Microsoft documentation – particularly pages 41-47 of the SharePoint 2010 Capacity Planning Guide. Our next stop will be looking at another much misunderstood lead indicator for performance – disk IO and latency. Once again I will introduce you to a couple of useful tools and offer you what I think is the best way to approach disk performance requirements that will make life easier for you and your storage people.

Until then, thanks for reading…

Paul Culmsee

HGBP_Cover-236x300

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jun 06 2012

Demystifying SharePoint Performance Management Part 3 – Getting at RPS

Hi and welcome back to this series aimed at making SharePoint performance management a little more digestible. In the first post we examined the difference between lead and lag indicators and in the second post, we specifically looked at the lead indicator of Requests Per Seconds (RPS) and its various opportunities and issues. In this episode we are actually going to do some real work at the – wait for it – the command line! As a result the collective heart rates of my business oriented readers – who are avid users of the “I’m business, not technical” cliché  – will start to rise since anything that involves a command line is shrouded in mystique, fear, uncertainty and doubt.

For the tech types reading this article, please excuse the verboseness of what I go through here. I need to keep the business types from freaking out.

Okay… so in the last post I said that despite its issues in terms of being a reliable indicator of future performance needs, RPS has the advantage that it can be derived from your existing deployment. This is because the information needed is captured in web server (IIS) logs over time. Having this past performance means you have a lag indicator view of RPS, which can be used as a basis to understand what the future might look like with more confidence than some arbitrary “must handle x RPS.”

Now just because RPS is held inside web server log files, does not mean it is easy to get to. In this post, I will outline the 3 steps needed to manipulate logfiles to extract that precious RPS goodness. The utility that we are going to use to do this is Log parser.

Now a warning here: This post assumes your existing deployment runs on Microsoft’s IIS platform v7 (the webserver platform that underpins SharePoint 2010). If you are running one of the myriad of portal/intranet platforms, you are going to have to take this post as a guide and adjust to your circumstances.

Step 1: Getting Log parser

Installing Log parser is easy. Just install version 2.2 as you would any other tool. It will run on pretty much any Windows operating system. Once installed, it will likely reside in the C:\Program Files (x86)\Log Parser 2.2 folder. (Or C:\Program Files\Log Parser 2.2 if you have an older, 32 bit PC).

There you go business types – that wasn’t so hard was it?

Step 2: Getting your web server logs

After the relative ease of getting log parser installed, we now need the logs themselves to play with. We are certainly not going to mess with a production system so we will need to copy the log files for your current portal to the PC where you installed Log parser. If you do not have access to these log files, call your friendly neighbourhood systems administrator to get them for you. If you have access (or do not have a friendly neighbourhood systems administrator), then you will need to locate the files you need. Here’s how:

Assuming you have access to your web front end server/s, you can load Internet Information Services (IIS) Manager from Start->All programs->Administrative tools on the server. Using this tool we can find out the location of the IIS log files as well as the specific logs we need. By default IIS logs are stored in C:\inetpub\logs\LogFiles, but it is common for this location to be changed to somewhere else. To confirm this in IIS manager, click on the server name in the left pane, then click on the Logging” icon in the right pane. In the example below, we can see that the IIS logfiles live in G:\LOGS\IIS folder (I always move the logfiles off C:\ as a matter of principle). While you are there, pay special attention to the fairly nondescript “Use local time for file naming and rollover” tickbox. We are going to return to that later…

image  image

Okay so we know where the log files live, so lets work out the sub-folder for the specific site. Back in the left hand pane now, expand “Sites” and find the web site you want to profile for RPS. When you have found it, select it and find the “Advanced Settings” link and click it.

image_thumb5

On the next screen you will see ID of the site. It will be a large number – something like 1045333159. Take a note of this ID, because all IIS logs for this site will be stored in a folder with the name “W3SVC” prepended to this ID (eg W3SVC1045333159). Thus the folder we are looking for is G:\LOGS\IIS\W3SVC1045333159. Copy the contents of this folder to the computer where you have installed logparser to. (In my example below I copied the logs to E:\LOGS\IIS_WWW\W3SVC1045333159 on a test server).

image10_thumb2 image14_thumb

Step 3: Preparation of log files…

Okay, so now we have our log files copied to our PC, so we can start doing some log parser magic. Unfortunately default IIS logfile format does not make RPS reporting particularly easy and we have to process the raw logs to make a file that is easier to use. Now business people – stay with me here… the payoff is worth the command line pain you are about to endure! Smile

First up, we will make use of the excellent work of Mike Wise (You can find his original document here), who created a script for log parser that processes all of the logfiles and creates a single (potentially very large) file that:

  • includes a new field which is the time of the day converted into seconds
  • splits the date and timestamp up into individual bits (day, month, hour, minute, etc.) This makes it easier to do consolidated reports
  • excludes 401 authentication requests (way back in part 1 I noted that Microsoft excludes authentication traffic from RPS)

I have pasted a modified version of Mike’s log parser script below, but before you go and copy it into Notepad, make sure you check two really important things.

  1. Be sure to change the path in the second last line of the script to the folder where you copied the IIS logs to (In my case it was E:\LOGS\IIS_WWW\W3SVC1045333159\*.log)
  2. Check whether IIS is saving your logfiles using UTC timestamps or local timestamps. (Now you know why I told you to specifically make note of the “Use local time for file naming and rollover” tickbox earlier). If the box is unticked, the logs are in UTC time and you should use the first script pasted below. If it is ticked, the logs are in local time the second script should be used.

UTC Script

select EXTRACT_FILENAME(LogFilename),LogRow, date, time, cs-method, cs-uri-stem, cs-username,
c-ip, cs(User-Agent), cs-host, sc-status, sc-substatus, sc-bytes, cs-bytes, time-taken,

add(
    add(
         mul(3600,to_int(to_string(to_localtime(to_timestamp(date,time)),'hh'))),
         mul(60,to_int(to_string(to_localtime(to_timestamp(date,time)),'mm')))
    ),
    to_int(to_string(to_localtime(to_timestamp(date,time)),'ss'))
) as secs,

add(
    mul(60,to_int(to_string(to_localtime(to_timestamp(date,time)),'hh'))),
    to_int(to_string(to_localtime(to_timestamp(date,time)),'mm'))
) as minu,

to_int(to_string(to_localtime(to_timestamp(date,time)),'yy')) as yy,
to_int(to_string(to_localtime(to_timestamp(date,time)),'MM')) as mo,
to_int(to_string(to_localtime(to_timestamp(date,time)),'dd')) as dd,
to_int(to_string(to_localtime(to_timestamp(date,time)),'hh')) as hh,
to_int(to_string(to_localtime(to_timestamp(date,time)),'mm')) as mi,
to_int(to_string(to_localtime(to_timestamp(date,time)),'ss')) as ss,
to_lowercase(EXTRACT_PATH(cs-uri-stem)) as fpath,
to_lowercase(EXTRACT_FILENAME(cs-uri-stem)) as fname,
to_lowercase(EXTRACT_EXTENSION(cs-uri-stem)) as fext

from e:\logs\iis_www\W3SVC1045333159\*.log

where sc-status<>401 and date BETWEEN TO_TIMESTAMP(%startdate%, 'yyyy-MM-dd') and TO_TIMESTAMP(%enddate%, 'yyyy-MM-dd')

Local Time Script

select EXTRACT_FILENAME(LogFilename),LogRow, date, time, cs-method, cs-uri-stem, cs-username,
c-ip, cs(User-Agent), cs-host, sc-status, sc-substatus, sc-bytes, cs-bytes, time-taken,

add(
   add(
      mul(3600,to_int(to_string(to_timestamp(date,time),'hh'))),
      mul(60,to_int(to_string(to_timestamp(date,time),'mm')))
   ),
   to_int(to_string(to_timestamp(date,time),'ss'))
) as secs,

add(
   mul(60,to_int(to_string(to_timestamp(date,time),'hh'))),
   to_int(to_string(to_timestamp(date,time),'mm'))
) as minu,

to_int(to_string(to_timestamp(date,time),'yy')) as yy,
to_int(to_string(to_timestamp(date,time),'MM')) as mo,
to_int(to_string(to_timestamp(date,time),'dd')) as dd,
to_int(to_string(to_timestamp(date,time),'hh')) as hh,
to_int(to_string(to_timestamp(date,time),'mm')) as mi,
to_int(to_string(to_timestamp(date,time),'ss')) as ss,
to_lowercase(EXTRACT_PATH(cs-uri-stem)) as fpath,
to_lowercase(EXTRACT_FILENAME(cs-uri-stem)) as fname,
to_lowercase(EXTRACT_EXTENSION(cs-uri-stem)) as fext

from e:\logs\iis_www\W3SVC1045333159\*.log

where sc-status<>401 and date BETWEEN TO_TIMESTAMP(%startdate%, 'yyyy-MM-dd') and TO_TIMESTAMP(%enddate%, 'yyyy-MM-dd')

After choosing the appropriate script and modifying the second last line, save this file into the Log parser installation folder and call it GETSECONDS.TXT.

For the three readers who *really* want to know, the key part of what this does is to take the timestamp of each log entry and turn it into what second of the day it is and what minute of the day it is. So assuming the timestamp is 8:35am at the 34 second park, the formula effectively adds together:

  • 8 * 3600 (since there are 3600 seconds in an hour)
  • 35 * 60 (60 seconds in a minute)
  • 34 seconds

= 30934 seconds

  • 8 * 60 (60 minutes in an hour)
  • 35 minutes

= 515 minutes

Now that we have our GETSECONDS.TXT script ready, let’s use Log parser to generate our file that we will use for reporting. Open a command prompt (for later versions of windows make sure it is an administrator command prompt) and change directory to the LogParser installation location.

C:\Program Files (x86)\Log Parser 2.2>

Now decide a date to report on. In my example, the logs go back two years and I only want the the 15th of November 2011. The format for the dates MUST be “yyyy-mm-dd” (e.g. 2011-11-15).

Type in the following command (substituting whatever date range interests you):

logparser -i:IISW3C file:GetSeconds.txt?startdate=’2011-11-15′+enddate=’2011-11-15′ -o:csv >e:\temp\LogWithSeconds.csv

  • The –i parameter specifies the type of input file. In this case the input file is IISW3C (IIS weblog format)
  • The ?startdate parameter specifies the start  date you want to process
  • The +enddate parameter specifies the end date you want to process
  • The –o parameter specifies the type of output file. In this case the output file is CSV format
  • The –q parameter says not to prompt the user for anything
  • The >LogWithSeconds.csv says to save the CSV output into a file called LogsWithSeconds.csv

So depending on how many logfiles you had in your logs folder, things may take a while to process. Be patient here… after all, it might be processing years of logfiles (and now you know why we didn’t do this in a production install!). Also be warned, the resulting LogWithSeconds.csv that is created will be very very big if you specified a wide date range. Whatever you do, do not open this file with notepad if its large! We will be using additional log parser scripts to interrogate it instead.

Conclusion

Right! If you got this far and your normally not a command line kind of person… well done! If you are a developer, thanks for sticking with me. You should have a newly minted file called LogWithSeconds.csv and you are ready to do some interrogation of it. In the next post, I will outline some more logparser scripts that generate some useful information!

Until then, thanks for reading

Paul Culmsee

p.s Why not check out my completely non SharePoint book entitled “The Heretics Guide to Best Practices”. It recently won a business book award.

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jun 02 2012

Demystifying SharePoint Performance Management Part 2 – So what is RPS anyway?

Hi all

I never mentioned it in the first post that the reason I am blogging again is I finally completed most of the game Skyrim. Man – that game is dangerous if you value your time!

Anyway, in the first post, I introduced this series by covering the difference between lead and lag performance indicators. To recap from part 1, a lead indicator is something that can help predict an outcome by measuring an action, whereas a lag indicator measures the result or outcome achieved from taking an action. This distinction is important to understand, because otherwise it is easy to use performance measurements inappropriately or get very confused. Lead indicators in particular sometimes feel wishy washy because it is hard to have a direct correlation to what you are seeing.

In this post, we are going to examine one of the most commonly cited (and abused) lead indicators to measure for performance. Good old Requests Per Second (RPS). Let’s attempt to make this more clear…

Microsoft defines RPS as:

The number of requests received by a farm or server in one second. This is a common measurement of server and farm load. The number of requests processed by a farm is greater than the number of page loads and end-user interactions. This is because each page contains several components, each of which creates one or more requests when the page is loaded. Some requests are lighter than other requests with regard to transaction costs. In our lab tests and case study documents, we remove 401 requests and responses (authentication handshakes) from the requests that were used to calculate RPS because they have insignificant impact on farm resources

So according to this definition, RPS is any interaction between browsers (or any other device or service making web requests) and the SharePoint webserver, excluding authentication traffic. The logic of measuring requests per second is that it provides insight into how much load your SharePoint box can take because, after all, SharePoint at the end of the day is servicing requests from users.

RPS by example

Before we start picking apart RPS and its issues, let’s look at an example. Assuming you are viewing this page in Internet Explorer version 8 or 9, press F12 right now. You should see something like the screen below. If you have not seen it before, it is called the internet explorer developer tools and is bloody handy. Now click on the “Network” link, highlighted below and then click the “Start capturing” button.

image

Now refresh this page and watch the result. You should see a bunch of activity logged, looking something like the picture below.

image

What you are looking at is all of the requests that your browser has made to load this very page. While the detail is not overly important for the purpose of this post, the key point is that to load this page, many requests were made. In fact if you look in the left-bottom corner of the above screenshot, a total of 130 individual requests are listed.

So, first pop-quiz for the day: Were all 130 requests made to my cleverworkarounds blog to refresh this page? The answer my friends is no. In actual fact, only 2 items were loaded from my blog!

So why the discrepancy? What happened to the other 128 requests? Two main reasons.

1. Browser cache: First up, many of the items listed above were cached by my browser already. I’ve been to this site before, and so a lot of the page components (CSS style sheets, logos and the like) did not have to be retrieved again. It just happens that the internet explorer developer tool shows requests that were handled by locally cached data as well as actual requests made to SharePoint. If you look closely at the “Result” column in the above screenshot, you will see that some entries are grey colour while others are black. All of the grey entries are cached requests. They never left the confines of the browser. This alone accounts for 95 of the 130 requests.

Now this is worth consideration because if a browser has never accessed this site before, there will be no content in the browser cache. Therefore, on first access, the browser would indeed have made 95 additional requests to load the page. This scenario is most likely on day one of a production SharePoint rollout, where a large chunk of the workforce might load the homepage for the first time.

2. Content from other sites: The second reason for the discrepancy is that some content doesn’t even come from the cleverworkarounds site. Anytime you visit a blog and it has a snazzy widget like Amazon books or Facebook “like” buttons, that content is very likely being retrieved from Amazon or Facebook. In the case of this very article you are reading, 33 requests were made to other sites like Facebook, amazon, feedburner, sharepointads and whoever else happens to grace a widget on the right hand side. In these cases, my server is not handling this traffic at all. This accounts for 33 of the 130 requests.

95 + 33 = 128 of the 130 requests made.

So hopefully now you get what is meant by RPS. Let’s now look at its utility in measuring performance.

Dangers of RPS reliance…

Consider two fairly typical SharePoint transactions: The first example is loading the SharePoint home page and the second example is where a user loads a document from a SharePoint document library. Below I have compared the two transactions by using an Office 365 site of mine and capturing the requests made by each one. (For what its worth, I used a utility called fiddler rather than the developer toolbar because it has some snazzier features).

In example #1, we have loaded the homepage of an Office365 site (assuming for the first time). In all, 36 requests made to the server. If we add up the amount of data returned by the server (summing the “Body” column below), we have a total of 245,322 bytes received.

image

In request #2, we are looking at the trace of me opening a 7 megabyte document from a document library. Notice that this time, 17 requests were. But compared to the first example, significantly more data was returned from the server: 7,245,876 KB in fact. If you drill down further by examining the “Body” column, you will notice that of those 17 requests, 3 of them were the bulk of the data transferred with 3,149,348, 3,148,008 and 891,069 KB respectively.

image

So here is my point. Some requests are more significant than others! In the latter example, 3 of the 17 requests transferred 98% of the data. The second transaction also took much longer than the first, and the data was retrieved from the SQL Server database, which meant that this interaction with SharePoint likely had more back-end performance load than the first example when the home page was loaded. When loading the home page, the data may have been served from one of the many SharePoint caches and barely touching the back-end SQL box.

Now with that in mind, consider this: The typical rationale you see around the interweb for utilising RPS as a performance tool is to estimate future scalability requirements. Statements like "This SharePoint farm needs to be capable of 125RPS” are fairly common. Traditionally, the figure was derived from a methodology that looked something like:

  1. Work out the peak times of the day for SharePoint site usage (for example between 10:45am-2:45pm each day)
  2. Estimate the number of concurrent users accessing your SharePoint site during this time
  3. Classify the users via their usage profile (wussy, light, heavy, psycho, etc)
  4. Estimate how many transactions each of these user types might make in the peak hour (a transaction being an operation like browse home page, edit document, and so on)
  5. Multiply concurrent users by the number of expected transactions to derive the total number of transactions for the period
  6. Divide the total by the number of seconds in the period to work out how many transactions per second.

There are lots of issues with this methodology, but here are 4 obvious ones.

  1. The first is that it confuses transactions with requests. While browsing the SharePoint home page might be considered one “transaction”, it will likely consist of more than one request (particularly if the content being served is designed to be fairly dynamic and not rely on cache data). Essentially this methodology may underestimate the number of requests because it assumes a 1:1 relationship between a transaction and a request. My two examples above demonstrate that this is not the case.
  2. The classification of usage profile of users (light, medium, heavy) is crude and overlooks the aforementioned variation in usage patterns. A “heavy user” might continually update a SharePoint calendar, while a “light” user might load 20 megabyte documents or run sophisticated reports. In both cases, the real load on the infrastructure – and the resulting response time – may be quite varied.
  3. It fails to take into consideration the fact that SharePoint 2010 in particular has many new features in the form of Service Applications. These also make requests behind the scenes that have load implications. The most obvious example is the search crawling SharePoint sites.
  4. It also overlooks the fact that SharePoint content is often accessed indirectly. Many non-browser client tools such as SharePoint Workspace, OneNote, Outlook Social Connector, Harmon.ie and the like. If Colligo Contributor is deployed to all desktops, does that make all users “heavy?”

So hopefully by now, you can understand the folly of saying to someone “This system should be capable of handling 150RPS.” There is simply far too many variables that contribute to this, and each request can be wildly different in terms of real load on the back-end servers. Now you know why Robert Bogue likened this issue to Drakes Equation in part 1. The RPS target arrived at utilising this sort of methodology is likely to be fairly inaccurate and of questionable value.

So what is RPS good for and how do I get it?

So am I anti RPS? Definitely not!

The one thing RPS has going for it, which makes it incredibly useful, is that it is likely to be the one performance metric that any organisation can tap into straight away (assuming you have an existing deployment). This is because the metric is collected in web server (IIS) logs over time. Each request made to the server is logged with a date and timestamp. For most places, this is the only high fidelity performance data you have access to, because many organisations do not collect and store other stats like CPU and Disk IO performance over time. While its unlikely you would be able to see CPU for a server 6 months ago on Tuesday at 9:53am, chances are you can work out the RPS at that time if you have an existing intranet or portal. The reason for this is that IIS logs are not cleared so you have the opportunity to go back in time and see how a SharePoint site has been utilised.

The benefit is that we have the means to understand past performance patterns of an organisations use of their intranet or portal. We can work out stuff like:

  • peak times of the day for usage of the portal based on previous history
  • the maximum number of requests that the server has ever had to process
  • the rate of increase/decrease of RPS over time (ie “What was peak RPS 6 months ago?  What was it 3 months ago?)
  • the patterns/distribution of requests over a typical day (peaks and troughs – we can see the “shape” of SharePoint usage over a given period)

As an added bonus, the data in web server logs allow for some other fringe benefits including stuff like:

  • the percentage or pattern of requests were “non interactive” (such as % of requests that are search crawls or SharePoint workspace sync’s)
  • identifying usage patterns of certain users (eg top 10 users and their usage usage patterns)

Finally, if you monitor CPU and disk performance, you can compare the RPS peaks against those other performance counters and then interpolate how things might have been in the past (although this has some caveats too).

Coming up next…

Okay so now you are convinced that RPS does not suck – and you want to get your hands on all this RPS goodness. The good news is that its fairly easy to do and Microsoft’s Mike Wise has documented the definitive way to do it. The bad news is, you have to download and learn a yet another utility. Fear not though as the utility (called LogParser) is brilliant and needs to be in your arsenal anyway (especially business oriented SharePoint readers of this blog – this is not one just for the techies). Put simply, LogParser provides the ability to do SQL-like queries to your log files. You can have it open a log file (or series of files), process them via a SQL style language, and then output the results of your query into different formats for reporting.

But, just as I have whetted your appetite, I am going to stop.This post is already getting large and I still have a bit to get through in relation to using LogParser, so I will focus on that in the next post.

Hopefully though at this point, you don’t totally hate RPS, have a much better idea of what RPS is and some of the issues of its use.

Thanks for reading

Paul Culmsee

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jan 30 2012

On the decay (or remarkable recurrence) of knowledge

“That’s only 10%…”

One of my mentors who is mentioned in the book I wrote with Kailash (Darryl) is a veteran project manager in the construction and engineering industry. He has been working as a project manager more than 30 years, is a fellow of the Institute of Engineers and marks exams at the local university for those studying a Masters Degree in Project Management. His depth of knowledge and experience is abundantly clear when you start working with him and I have learned more about collaborative project delivery from him than anyone else.

Recently I was talking with him and he said something really interesting. He was telling some stories from the early days of alliancing based project delivery in Australia (alliancing is a highly interesting collaborative project governance approach that we devote a chapter to in our book). He stated that alliancing at its core is the application of good project management practice. Now I know Darryl pretty well and knew what he meant by that, but commented to him that when you say the word “project management practice,” some would associate that statement with (among other things) a well-developed Gantt chart listing activities with names, tasks and times.

His reply was unsurprising: “at best that’s only 1/10th of what project management is really about.”

Clearly Darryl has a much deeper and holistic view of what project management is than many other practitioners I’ve worked with. Darryl argues that those who criticise project management are actually criticising a small subset of the discipline, based on their less than complete view of what the discipline entails. Thus by definition, the remedies they propose are misinformed or solve a problem that has already been solved.

Whether you agree with Darryl or not, there is a pattern here that occurs continually in organisation-land. Fanboys of a particular methodology, framework model or practice (me included) will waste no time dumping on whatever they have grown to dislike and swear that their “new approach” addresses the gaps. Those with a more holistic view like Darryl might argue that crusaders aren’t really inventing anything new and that if a gap exists, it’s a gap in the knowledge of those doing the criticising.

As Ambrose Bierce said, “There is nothing new under the sun but there are lots of things we don’t know.”

From project management to systems thinking…

Now with that in mind, here’s a little anecdote. A few weeks back I joined a Design Thinking group on LinkedIn. I had read about Design Thinking during its hype phase a couple of years ago and my immediate thought was “Isn’t this just systems thinking reinvented?” You see, I more or less identify myself as a bit of a pragmatic systems thinker, in that I like to broaden a discussion, but I also actually get shit done. So I was curious to understand how design thinkers see themselves as different from systems thinkers.

I followed several threads on the LinkedIn group as the question had been discussed a few times. Unfortunately, no-one could really put their finger on the difference. Eventually I found a recent paper by Pourdehnad, Wexler and Wilson which went into some detail on the two disciplines and offered some distinctions. I won’t bother you with the content, except to say it was a good read, and left me with the following choices about my understanding of systems and design thinking:

  • That my understanding of systems thinking is wrong and I am in fact a design thinker after all
  • That I am indeed a systems thinker and design thinking is just systems thinking with a pragmatic bent

Of course being a biased human, I naturally believe the latter point is more correct. clip_image002

From systems to #stoos

Like the Snowbird retreat that spawned the agile manifesto, the recent stoos movement has emerged from a group of individuals who came together to discuss problems they perceive in existing management structures and paradigms. Now this would have been an exhilarating and inspiring event to be at – a bunch of diverse people finding emergent new understandings of organisations and how they ought to be run. Much tacit learning would have occurred.

But a problem is that one has to have been there to truly experience it. Any published output from this gathering cannot convey the vibe and learning (the tacit punch) that one would get from experiencing the event in the flesh. This is the effect of codifying knowledge into the written form. Both myself and Kailash were fully cognisant of this when we read the material on the stoos website and knew that for us, some of it would cover old ground. Nevertheless, my instinctive first reaction to what I read was “I bet someone will complain that this is just design thinking reinvented.”

Guess what… a short time later that’s exactly what happened too. Someone tweeted that very assertion! Presumably this opinion was offered by a self-identified design thinker who felt that the stoos crowd was reinventing the wheel that design thinkers had so painstakingly put together. My immediate urge was to be a smartarse and send back a tweet telling this person that design thinking is just pragmatic systems thinking anyway so he was just as guilty as the #stoos crowd. I then realised I might be found guilty of the same thing and someone might inform me of some “deeper knowing” than systems thinking. Nevertheless I couldn’t resist and made a tweet to that effect.

The decay (or remarkable recurrence) of knowledge…

(At this point I discussed this topic with Kailash and have looped him into the conversation)

Both of us see a pattern of a narrow focus or plain misinterpretation of what has come before. As a result, it seems there is a tendency to reinvent the wheel and slap a new label on claiming it to be unique or profound. We wonder therefore, how much of the ideas of new groups or movements are truly new.

Any corpus of knowledge is a bunch of memes – “ideas, behaviours or styles that spread from person to person within a culture.” Indeed, entire disciplines such as project management can be viewed as a bunch of memes that have been codified into a body of knowledge. Some memes are “sticky,” in that they are more readily retained and communicated, while others get left behind. However, stickiness is no guarantee of rightness. Two examples of such memes that we covered in our book are the waterfall methodology and the PERT scheduling technique Though both have murky origins and are of questionable utility, they are considered to be stock standard in the PM world, at least in certain circles. While it would take us too far afield to recount the story here (and we would rather you read our book Smile ) the point is that some techniques are widely taught and used despite being deeply flawed. Clearly the waterfall meme had strong evolutionary characteristics of survival while the story of its rather nuanced beginnings have been lost until recently.

A person indoctrinated in a standard business school curriculum sees real-life situations through the lens of the models (or memes!) he or she is familiar with. To paraphrase a well-known saying – if one is familiar only with a hammer, every problem appears as a nail. Sometime (not often enough!) the wielder of the metaphorical hammer eventually realises that not all problems yield to hammering. In other words, the models they used to inform their actions were incomplete, or even incorrect. They then cast about for something new and thus begin a quest for a new understanding. In the present day world one doesn’t have to search too hard because there are several convenient corpuses of knowledge to choose from. Each supply ready-made models of reality that make more sense than the last and as an added bonus, one can even get a certification to prove that one has studied it.

However, as demonstrated above with the realisation that not all problems yield to hammering, reality can truly be grasped only through experience, not models. It is experience that highlights the difference between the real-world and the simplistic one that is captured in models. Reality consists of complex, messy situations and any attempt to capture reality through concepts and models will always be incomplete. In the light of this it is easy to see why old knowledge is continually rediscovered, albeit in a different form. Since models attempt to grasp the ungraspable, they will all contain many similarities but will also have some differences. The stoos movement, design thinking and systems thinking are rooted in the same reality, so their similarities should not be surprising.

Coming back to Darryl – his view of project management with 30 years experience includes a whole bunch of memes and models, that for whatever reason, tend to be less sticky than the ones we all know so well. Why certain memes are less successful than others in being replicated from person to person is interesting in its own right and has been discussed at length in our book. For now, we’ll just say that those who come up with new labels to reflect their new understandings are paradoxically wise and narrow minded at the same time. They are wise in that they are seeking better models to understand the reality they encounter, but at the same time likely trashing some worthwhile ones too. Reality is multifaceted and cannot be captured in any particular model, so the finders of a new truth should take care that they do not get carried away by their own hyperbole.

Thanks for reading

Paul Culmsee (with Kailash Awati)

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jan 17 2012

Why can’t people find stuff on the intranet?–Final summary

Hi

Those of you who get an RSS feed of this blog might have noticed it was busy over last week. This is because I pushed out 4 blog posts that showed my analysis using IBIS of a detailed linear discussion on LinkedIn. To save people getting lost in the analysis, I thought I’d quickly post a bit of an executive summary from the exercise.

To set context, Issue Mapping is a technique of visually capturing rationale. It is graphically represented using a simple, but powerful, visual structure called IBIS (Issue Based Information System). IBIS allows all elements and rationale of a conversation to be captured in a manner that can be easily reflected upon. Unlike prose, which is linear, the advantage of visually representing argument structure is it helps people to form a better mental model of the nature of a problem or issue. Even better, when captured this way, makes it significantly easier to identify emergent themes or key aspects to an issue.

You can find out all about IBIS and Dialogue Mapping in my new book, at the Cognexus site or the other articles on my blog.

The challenge…

On the Intranet Professionals group on LinkedIn recently, the following question was asked:

What are the main three reasons users cannot find the content they were looking for on intranet?

In all, there were more than 60 responses from various people with some really valuable input. I decided that it might be an interesting experiment to capture this discussion using the IBIS notion to see if it makes it easier for people to understand the depth of the issue/discussion and reach a synthesis of root causes.

I wrote 4 posts, each building on the last, until I had covered the full conversation. For each post, I supplied an analysis of how I created the IBIS map and then exported the maps themselves. You can follow those below:

Part 1 analysis: http://www.cleverworkarounds.com/2012/01/15/why-cant-users-find-stuff-on-the-intranet-in-ibis-synthesispart-1/
Part 2 analysis: http://www.cleverworkarounds.com/2012/01/15/why-cant-users-find-stuff-on-the-intranet-an-ibis-synthesispart-2/
Part 3 analysis: http://www.cleverworkarounds.com/2012/01/16/why-cant-users-find-stuff-on-the-intranet-an-ibis-synthesispart-3/
Part 4 analysis: http://www.cleverworkarounds.com/2012/01/16/why-cant-users-find-stuff-on-the-intranet-an-ibis-synthesispart-4/

Final map: http://www.cleverworkarounds.com/maps/findstuffpart4/Linkedin_Discussion__192168031326631637693.html

For what its worth, the summary of themes from the discussion was that there were 5 main reasons for users not finding what they are looking for on the intranet.

  1. Poor information architecture
  2. Issues with the content itself
  3. People and change aspects
  4. Inadequate governance
  5. Lack of user-centred design

Within these areas or “meta-themes” there were varied sub issues. These are captured in the table below.

Poor information architecture Issues with content People and change aspects Inadequate governance Lack of user-centred design
Vocabulary and labelling issues

· Inconsistent vocabulary and acronyms

· Not using the vocabulary of users

· Documents have no naming convention

Poor navigation

Lack of metadata

· Tagging does not come naturally to employees

Poor structure of data

· Organisation structure focus instead of user task focussed

· The intranet’s lazy over-reliance on search

Old content not deleted

Too much information of little value

Duplicate or “near duplicate” content

Information does not exist or an unrecognisable form

People with different backgrounds, language, education and bias’ all creating content

Too much “hard drive” thinking

People not knowing what they want

Lack of motivation for contributors to make information easier to use

Google inspired inflated expectations on search functionality on intranet

Adopting social media from a hype driven motivation

Lack of governance/training around metadata and tagging

Not regularly reviewing search analytics

Poor and/or low cost search engine is deployed

Search engine is not set up properly or used to full potential

Lack of “before the fact” coordination with business communications and training

Comms and intranet don’t listen and learn from all levels of the business.

Ambiguous, under-resourced or misplaced Intranet ownership

The wrong content is being managed

There are easier alternatives available

Content is structured according to the view of the owners rather than the audience

Not accounting for two types of visitors… task-driven and browse-based

No social aspects to search

Not making the search box available enough

A failure to offer an entry level view

Not accounting for people who do not know what they are looking for versus those who do

Not soliciting feedback from a user on a failed search about what was being looked for

So now you have seen the final output, be sure to visit the maps and analysis and read about the journey on how this table emerged. One thing is for sure, it sure took me a hell of a lot longer to write about it than to actually do it!

Thanks for reading

Paul Culmsee

www.sevensigma.com.au

www.hereticsguidebooks.com

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jan 16 2012

Why can’t users find stuff on the intranet? An IBIS synthesis–Part 4

Hi and welcome to my final post on the linkedin discussion on why users cannot find what they are looking for on intranets. This time the emphasis is on synthesis… so let’s get the last few comments done shall we?

Michael Rosager • @ Simon. I agree.
Findability and search can never be better than the content available on the intranet.
Therefore, non-existing content should always be number 1
Some content may not be published with the terminology or language used by the users (especially on a multilingual intranet). The content may lack the appropriate meta tags. – Or maybe you need to adjust your search engine or information structure. And there can be several other causes…
But the first thing that must always be checked is whether they sought information / data is posted on the intranet or indexed by the search engine.

Rasmus Carlsen • in short:
1: Too much content (that nobody really owns)
2: Too many local editors (with less knowledge of online-stuff)
3: Too much “hard-drive-thinking” (the intranet is like a shared drive – just with a lot of colors = a place you keep things just to say that you have done your job)

Nick Morris • There are many valid points being made here and all are worth considering.
To add a slightly different one I think too often we arrange information in a way that is logical to us. In large companies this isn’t necessarily the same for every group of workers and so people create their own ‘one stop shop’ and chaos.
Tools and processes are great but somewhere I believe you need to analyse what information is needed\valued and by whom and create a flexible design to suit. That is really difficult and begins to touch on how organisations are structured and the roles and functions of employees.

Taino Cribb • Hi everyone
What a great discussion! I have to agree to any and all of the above comments. Enabling users to find info can definately be a complicated undertaking that involves many facets. To add a few more considerations to this discussion:
Preference to have higher expectations of intranet search and therefore “blame” it, whereas Google is King – I hear this too many times, when users enter a random (sometimes misspelled) keyword and don’t get the result they wish in the first 5 results, therefore the “search is crap, we should have Google”. I’ve seen users go through 5 pages of Google results, but not even scroll down the search results page on the intranet.
Known VS Learned topics – metadata and user-tagging is fantastic to organise content we and our users know about, but what about new concepts where everyone is learning for the first time? It is very difficult to be proactive and predict this content value, therefore we often have to do so afterwards, which may very well miss our ‘window of opportunity’ if the content is time-specific (ie only high value for a month or so).
Lack of co-ordination with business communications/ training etc (before the fact). Quite often business owners will manage their communications, but may not consider the search implications too. A major comms plan will only go so far if users cannot search the keywords contained in that message and get the info they need. Again, we miss our window if the high content value is valid for only a short time.
I very much believe in metadata, but it can be difficult to manage in SP2007. Its good to see the IM changes in SP2010 are much improved.

Of the next four comments most covered old ground (a sure sign the conversation is now fairly well saturated). Nick says he is making a “a slightly different” point, but I think issues of structure not suiting a particular audience has been covered previously. I thought Taino’s reply was interesting because she focused on the issue of not accounting for known vs. learned topics and the notion of a “window of opportunity” in relation to appropriate tagging. Perhaps this reply was inspired by what Nick was getting at? In any event, adding it was a line call between governance and information architecture and for now, I chose the latter (and I have a habit of changing my mind with this stuff :-) .

image_thumb[12]

I also liked Taino’s point about user expectations around the “google experience” and her examples. I also loved earlier Rasmus’s point about “hard-drive thinking” (I’m nicking that one for my own clients Rasmus Smile). Both of these issues are clearly people aspects, so I added them as examples around that particular theme.

image_thumb[14]

Finally, I added Taino’s “lack of co-ordination” comments as another example of inadequate governance.

image_thumb[18]

Anne-Marie Low • The one other thing I think missing from here (other than lack of metadata, and often the search tool itself) is too much content, particularly out of date information. I think this is key to ensuring good search results, making sure all the items are up to date and relevant.

Andrew Wright • Great discussion. My top 3 reasons why people can’t find content are:
* Lack of meta data and it’s use in enabling a range of navigation paths to content (for example, being able to locate content by popularity, ownership, audience, date, subject, etc.) See articles on faceted classification:
http://en.wikipedia.org/wiki/Faceted_classification
and
Contextual integration
http://cibasolutions.typepad.com/wic/2011/03/contextual-integration-how-it-can-transform-your-intranet.html#tp
* Too much out-of-date, irrelevant and redundant information
See slide 11 from the following presentation (based on research of over 80 intranets)
http://www.slideshare.net/roowright/intranets2011-intranet-features-that-staff-love
* Important information is buried too far down in the hierarchy
Bonus 2 reasons :-)
* Web analytics and measures not being used to continuously improve how information is structured
* Over reliance on Search instead of Browsing – see the following article for a good discussion about this
Browse Versus Search: Stumbling into the Unknown
http://idratherbewriting.com/2010/05/26/browse-versus-search-organizing-content-9/

Both Anne and Andrew make good points and Andrew supplies some excellent links too, but all of these issues have been covered in the map so nothing more has been added from this part of the discussion.

Juan Alchourron • 1) that particular, very important content, is not yet on the intranet, because “the” director don’t understand what the intranet stands for.
2) we’re asuming the user will know WHERE that particular content will be placed on the intranet : section, folder and subfolder.
3) bad search engines or not fully configured or not enough SEO applied to the intranet

John Anslow • Nowt new from me
1. Search ineffective
2. Navigation unintuitive
3. Useability issues
Too often companies organise data/sites/navigation along operational lines rather than along more practical means, team A is part of team X therefore team A should be a sub section of team X etc. this works very well for head office where people tend to have a good grip of what team reports where but for average users can cause headaches.
The obvious and mostly overlooked method of sorting out web sites is Multi Variant Testing (MVT) and with the advent of some pretty powerful tools this is no longer the headache that it once was, why not let the users decide how they want to navigate, see data, what colour works best, what text encourages them to follow what links, in fact how it works altogether?
Divorcing design, usability, navigation and layout from owners is a tough step to take, especially convincing the owners but once taken the results speak for themselves.

Most of these points are already well discussed, but I realised I had never made a reference to John’s point about organisational structures versus task based structures for intranets. I had previously captured rationale around the fact that structures were inappropriate, so I added this as another example to that argument within information architecture…

image

Edwin van de Bospoort • I think one of the main reasons for not finding the content is not poor search engines or so, but simply because there’s too much irrelevant information disclosed in the first place.
It’s not difficult to start with a smaller intranet, just focussing on filling out users needs. Which usually are: how do I do… (service-orientated), who should I ask for… (corporate facebok), and only 3rd will be ‘news’.
So intranets should be task-focussed instead if information-focussed…
My 2cnts ;)

Steven Kent • Agree with Suzanne’s suggestion “Old content is not deleted and therefore too many results/documents returned” – there can be more than one reason why this happens, but it’s a quick way to user frustration.

Maish Nichani • It is interesting to see how many of us think metadata and structure are key to finding information on the intranet. I agree too. But come to think of it, staff aren’t experts in information management. It’s all very alien to them. Not too long ago, they had their desktops and folders and they could find their information when they wanted. All this while it was about “me and my content”. Now we have this intranet and shared folders and all of a sudden they’re supposed to be thinking about how “others” would like to find and use the information. They’ve never done this before. They’ve never created or organized information for “others”. Metadata and structure are just “techie” stuff that they have to do as part of their publishing, but they don’t know why they’re doing it or for what reason. They real problem, in my opinion, is lack of empathy.

Barry Bassnett • * in establishing a corporate taxonomy.1. Lack of relevance to the user; search produces too many documents.3. Not training people in the concept that all documents are not created by the individual for the same individual but as a document that is meant to be shared. e.g. does anybody right click PDFs to add metadata to its properties? Emails with a subject line stat describe what is in it.

Luc de Ruijter • @Maish. Good point about information management.
Q: Who’d be responsible to oversee the management of information?
Shouldn’t intranet managers/governors have that responsibility?
I can go along with (lack of) empathy as an underlying reason why content isn’t put away properly. This is a media management legacy reason: In media management content producers never had to have empathy with participating users, for there were only passive audiences.
If empathy is an issue. Then it proves to me that communication strategies are still slow to pick up on the changes in communication behaviour and shift in mediapower, in the digital age.
So if we step back from technological reasons for not finding stuff (search, meta, office automation systems etc.) another big reason looks around the corner of intranet management: those responsible for intranet policies and strategy.

Most of this discussion covers stuff already represented in the map, although I can see that in this part of the conversation there is a preoccupation with content and its relevance. Maish also makes a couple of good points. First up he makes the point that staff are not experts in information management and don’t tend to think about how someone else might wish to find the information later. He also concludes by stating the real problem is a lack of empathy. I liked this and felt that this was a nice supporting argument to the whole conjecture that “people issues” is a major theme in this discussion, so I added it as a pro.

image

 

Now we have an interesting bit in the conversation (for me anyway). Terry throws a curveball question. (Side note: Curveball questions are usually asked with genuine intent, but tend to have a negative effect on live meetings. Dialogue Mapping loves curveball questions as it is often able to deflect its negative impacts).

Terry Golding • Can I play devils advocate and ask WHY you feel meta data is so vital? Dont misunderstand me I am not saying that it is not important, but I cant help feeling that just saying meta data as a reason for not finding things is rather a simplification. Let me ask it another way, what is GOOD meta data, can you give examples please ?

Luc de Ruijter • @Terry. Good questions which can have many answers (see all comments above where you’ll find several answers already). Why do library books have labels on their covers? Those labels are in fact metadata (avant la lettre) which help library people ordering their collection, and clients to find titles. How do you create tag clouds which offer a more intuitive and user centered way to navigate a website/blog? By tagging all content with (structured) meta tags.Look around a bit and you’ll see that metadata are everywhere and that they serve you in browsing and retrieving content. That’s why metadata are vital these days.I think there are no strict right and good meta structures. Structures depend on organisational contexts. Some metastructures are very complex and formal (see comments about taxonomies above), others are quite simple.Metadata can enable users to browse information blocks. By comparisson navigation schemes can only offer rigid sender driven structures to navigate to pages.

Andrew Wright • @Terry. Meta data enables content to be found in a number of different ways – not just one as is typical of paper based content (and many intranets as well unfortunately).
For instance, if you advertise a house for sale you may have meta data about the house such as location, number of rooms and price. This then allows people to locate the house using this meta data (eg. search by number of bedrooms, price range, location). Compare this with how houses are advertised in newspapers (ie. by location only) and you can see the benefits of meta data.
For a good article about the benefits of meta data, read Card Sorting Doesn’t Cut the Custard:
http://www.zefamedia.com/websites/card-sorting-doesnt-cut-the-custard/
To read a more detailed example about how meta data can be applied to intranets, read:
Contextual integration: how it can transform your intranet
http://cibasolutions.typepad.com/wic/2011/03/contextual-integration-how-it-can-transform-your-intranet.html

Terry questions the notion of metadata. I framed it as a con against the previous metadata arguments. Both Luc and Andrew answer and I think the line that most succinctly captures the essence of than answer is Andrew’s “Meta data enables content to be found in a number of different ways”. So I reframe that slightly as a pro supporting the notion that lack of metadata is one of the reasons why users can;t find stuff on the intranet.

image

Next is yours truly…

Paul Culmsee • Hi all
Terry a devils advocate flippant answer to your devils advocate question comes from Corey Doctrow with his dated, but still hilarious essay on the seven insurmountable obstacles to meta-utopia :-) Have a read and let me know what you think.
http://www.well.com/~doctorow/metacrap.htm
Further to your question (and I *think* I sense the undertone behind your question)…I think that the discussion around metadata can get a little … rational and as such, rational metadata metaphors are used when they are perhaps not necessarily appropriate. Yes metadata is all around us – humans are natural sensemakers and we love to classify things. BUT usually the person doing the information architecture has a vested interest in making the information easy for you. That vested interest drives the energy to maintain the metadata.
In user land in most organisations, there is not that vested interest unless its on a persons job description and their success is measured on it. For the rest of us, the energy required to maintain metadata tends to dissipate over time. This is essentially entropy (something I wrote about in my SharePoint Fatigue Syndrome post)
http://www.cleverworkarounds.com/2011/10/12/sharepoint-fatigue-syndrome/

Bob Meier • Paul, I think you (and that metacrap post) hit the nail on the head describing the conflict between rational, unambiguous IA vs. the personal motivations and backgrounds of the people tagging and consuming content. I suspect it’s near impossible to develop a system where anyone can consistently and uniquely tag every type of information.
For me, it’s easy to get paralyzed thinking about metadata or IA abstractly for an entire business or organization. It becomes much easier for me when I think about a very specific problem – like the library book example, medical reports, or finance documents.

Taino Cribb • @Terry, brilliant question – and one which is quite challenging to us that think ‘metadata is king’. Good on you @Paul for submitting that article – I wouldn’t dare start to argue that. Metadata certainly has its place, in the absence of content that is filed according to an agreed taxonomy, correctly titled, the most recent version (at any point in time), written for the audience/purpose, valued and ranked comparitively to all other content, old and new. In the absence of this technical writer’s utopia, the closest we can come to sorting the wheat from the chaff is classifcation. It’s not a perfect workaround by any means, though it is a workaround.
Have you considered that the inability to find useful information is a natural by-product of the times? Remember when there was a central pool to type and file everything? It was the utopia and it worked, though it had its perceived drawbacks. Fast forward, and now the role of knowledge worker is disseminated to the population – people with different backgrounds, language, education and bias’ all creating content.
It is no wonder there is content chaos – it is the price we pay for progress. The best we as information professionals can do is ride the wave and hold on the best we can!

Now my reply to Terry was essentially speaking about the previously spoken of issue around lack of motivation on the part of users to make their information easy to use. I added a pro to that existing idea to capture my point that users who are not measured on accurate metadata have little incentive to put in the extra effort. Taino then refers to pace of change more broadly with her “natural by-product of the times” comment. This made me realise my meta theme of “people aspects” was not encompassing enough. I retitled it “people and change aspects” and added two of Taino’s points as supporting arguments for it.

image

At this point I stopped as enough had been captured the the conversation had definitely reached saturation point. It was time to look at what we had…

For those interested, the final map had 139 nodes.

The second refactor

At this point is was time to sit back and look at the map with the view of seeing if my emergent themes were correct and to consolidate any conversational chaff. Almost immediately, the notion of “content” started to bubble to the surface of my thinking. I had noticed that a lot of conversation and re-iteration by various people related to the content being searched in the first place. I currently had some of that captured in Information Architecture and in light of the final map, I felt that this wasn’t correct. The evidence for this is that Information Architecture topics dominated the maps. There were 55 nodes for information architecture, compared to 34 for people and change and 31 for governance.

Accordingly, I took all of the captured rationale related to content and made it its own meta-theme as shown below…

image

Within the “Issues with the content being searched” map are the following nodes…

image

I also did another bit of fine tuning too here and there and overall, I was pretty happy with the map in its current form.

The root causes

If you have followed my synthesis of what the dialogue from the discussion told me, it boiled down to 5 key recurring themes.

  1. Poor Information Architecture
  2. Issues with the content itself
  3. People and change aspects
  4. Inadequate governance
  5. Lack of user-centred design

I took the completed maps, exported the content to word and then pared things back further. This allowed me to create the summary table below:

Poor Information Architecture Issues with content People and change aspects Inadequate governance Lack of user-centred design
Vocabulary and labelling issues

· Inconsistent vocabulary and acronyms

· Not using the vocabulary of users

· Documents have no naming convention

Poor navigation

Lack of metadata

· Tagging does not come naturally to employees

Poor structure of data

· Organisation structure focus instead of user task focussed

· The intranet’s lazy over-reliance on search

Old content not deleted

Too much information of little value

Duplicate or “near duplicate” content

Information does not exist or an unrecognisable form

People with different backgrounds, language, education and bias’ all creating content

Too much “hard drive” thinking

People not knowing what they want

Lack of motivation for contributors to make information easier to use

Google inspired inflated expectations on search functionality on intranet

Adopting social media from a hype driven motivation

Lack of governance/training around metadata and tagging

Not regularly reviewing search analytics

Poor and/or low cost search engine is deployed

Search engine is not set up properly or used to full potential

Lack of “before the fact” coordination with business communications and training

Comms and intranet don’t listen and learn from all levels of the business.

Ambiguous, under-resourced or misplaced Intranet ownership

The wrong content is being managed

There are easier alternatives available

Content is structured according to the view of the owners rather than the audience

Not accounting for two types of visitors… task-driven and browse-based

No social aspects to search

Not making the search box available enough

A failure to offer an entry level view

Not accounting for people who do not know what they are looking for versus those who do

Not soliciting feedback from a user on a failed search about what was being looked for

The final maps

The final map can be found here (for those who truly like to see full context I included an “un-chunked” map which would look terrific when printed on a large sized plotter). Below however, is a summary as best I can do in a blog post format (click to enlarge). For a decent view of proceedings, visit this site.

Poor Information Architecture

part4map1

Issues with the content itself

part4map2

People and change aspects

part4map3

Inadequate governance

part4map4

Lack of user-centred design

part4map5

Thanks for reading.. as an epilogue I will post a summary with links to all maps and discussion.

Paul Culmsee

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jan 16 2012

Why can’t users find stuff on the intranet? An IBIS synthesis–Part 3

Hi all

This is the third post in a quick series that attempts to use IBIS to analyse an online discussion. The map is getting big now, but luckily, we are halfway through the discussion and will have most of the rationale captured by the end of this post. We finished the part 2 with a summary map that has grouped the identified reasons why it is hard to find information on intranets into core themes. Right now there are 4 themes that have emerged. In this post we see if there are any more to emerge and fully flesh out the existing ones. Below is our starting point for part 3.

part3map1_thumb5

Our next two responses garnered more nodes in the map than most others. I think this is a testament to the quality of their input to the discussion. First up Dan…

Dan Benatan • Having researched this issue across many diffferent company and departmental intranets, my most frequent findings are:
1. A complete lack of user-centred design. Content that many members of the organization need to access is structured according to the view of the content owners rather than the audience. This should come as no surprise, it remains the biggest challenge in public websites.
2. A failure to offer an entry level view. Much of the content held on departmental intranets is at a level of operational detail that is meaningless to those outside the team. The information required is there, but it is buried so deep in the documents that people outside the team can’t find it.
3. The intranet’s lazy over-reliance on search. Although many of us have become accustomed to using Google as our primary entry point to find content across the web, we may do this because we know we have no hope of finding the content through traditional navigation. The web is simply far too vast. We do not, however, rely purely on search once we are in the website we’ve chosen. We expect to be able to navigate to the desired content. Navigation offers context and enables us to build an understanding of the knowledge area as we approach the destination. In my research I found that most employees (>70%) try navigation first because they feel they understand the company well enough to know where to look.
4. Here I agree with many of the comments posted above. Once the user does try search, it still fails. The search engine returns too many results with no clear indication of their relative validity. There is a wealth of duplicate content on most intranets and , even worse, there is a wealth of ‘near duplicate’ content; some of which is accurate and up-to-date and much that is neither. The user has no easy way to know which content to trust. This is where good intranet management and good metadata can help.

My initial impression was that this was an excellent reply and Dan’s experience shone through it. I thought this was one of the best contributions to the discussion thus far. Let’s see what I added shall we?

First up, Dan returned to the user experience issue, which was one of the themes that had emerged. I liked his wording of the issue, so I also changed the theme node of “Inadequate user experience design” to Dan’s framing of “Lack of user-centred design”, which I thought was better put. I then added his point about content structured to the world view of owner, rather than audience. His second point about an “entry level view” relates to the first point in the sense that both are user centred design issues. So I added the entry level view point as an example…

image_thumb14

I added Dan’s point about the intranet’s lazy over-reliance on search to the information architecture theme. I did this because he was discussing the relationship between navigation and search, and navigation had already come up as an information architecture issue.

image_thumb23

Dan’s final point about too many results returned was already covered previously, but he added a lot of valuable arguments around it. I restructured that section of the map somewhat and incorporated his input.

image_thumb6

Next we have Rob, who also made a great contribution (although not as concise as Dan)

Rob Faulkner • Wow… a lot of input, and a lot of good ideas. In my experience there can be major liabilities with all of these more “global” concepts, however.
No secret… Meta data is key for both getting your site found to begin with, as well as aiding in on-site search. The weak link in this is the “people aspect” of the exercise, as has been alluded to. I’ve worked on interactive vehicles with ungodly numbers of pages and documents that rely on meta data for visibility and / or “findability” (yes, I did pay attention in English class once in a while… forgive me), and the problem — more often than not — stems from content managers either being lazy and doing a half ass job of tagging, if at all, or inconsistency of how things are tagged by those that are gung-ho about it. And, as an interactive property gets bigger, so too does the complexity tagging required to make it all work. Which circles back to freaking out providers into being lazy on the one hand, or making it difficult for anyone to get it “right” on the other. Vicious circle. Figure that one out and you win… from my perspective.
Another major issue that was also alluded to is organization. For an enterprise-class site, thorough taxonomy / IA exercises must be hammered out by site strategists and THEN tested for relevance to target audiences. And I don’t mean asking targets what THEY want… because 9 times out of 10 you’re either going to get hair-brained ideas at best, or blank stares at worst. You’ve just got to look at the competitive landscape to figure out where the bar has been set, what your targets are doing with your product (practical application, OEMing, vertical-specific use, etc… then Test the result of your “informed” taxonomy and IA to ensure that it does, in fact, resonate with your targets once you’ve gotten a handle on it.
Stemming from the above, and again alluded to, be cautious about how content is organized in order to reflect how your targets see it, not how internal departments handle it. Most of the time they are not one in the same. Further, you’ve got to assume that you’re going to have at least two types of visitors… task-driven and browse-based. Strict organization by product or service type may be in order for someone that knows what they’re looking for, but may not mean squat to those that don’t. Hence, a second axis of navigation that organizes your solutions / products by industry, pain point, what keeps you up at night, or whatever… will enable those that are browsing, or researching, a back door into the same ultimate content. Having a slick dynamic back-end sure helps pull this off
Finally, I think a big mistake made across all verticals is what the content consists of to begin with. What you may think is the holy grail, and the most important data or interactive gadget in the world may not mean a hill-of-beans to the user. I’ve conducted enough focus groups, worldwide, to know that this is all typically out of alignment. I never cease to be amazed at exactly what it is that most influences decision makers.
I know a lot of this was touched upon by many of you. Sorry about that… damn thread is just getting too long to go back and figure out exactly who said what!
Cheers…

Now Rob was the first to explicitly mention “People aspects”, and I immediately realised this was the real theme that “Lack of motivation on the part of contributors…”was getting at. So I restructured the map so that “people aspects” was the key theme and the previous point of “Lack of motivation” was an example. I then added Rob’s other examples.

image_thumb27

After making his points around people aspects, Rob then covers some areas well covered already (metadata, content organsiation), so I did not add any more nodes. But at the end, he added a point about browse oriented vs. search oriented users, which I did add to the user-centred design discussion.

image_thumb33

Rob also made a point about users who know what they want when searching for information vs. those who do not. (In Information Architecture terms, this is called “Known item seeking” vs “exploratory seeking”). That had not been covered previously, so I added it to the Information Architecture discussion.

image_thumb31

Finally, I captured Rob’s point about the wrong content being managed in the first place. This is a governance issue since the best Information architecture or user experience design won;t matter a hoot if you’re not making the right content available in the first place.

image_thumb32

Hans Leijström • Great posts! I would also like to add lack of quality measurements (e.g. number of likes, useful or not) and the fact that the intranets of today are not social at all…

Caleb Lamz • I think everyone has provided some great reasons for users not being able to find what they are looking for. I lean toward one of the reasons Bob mentions above – many intranets are simply not actively managed, or the department managing it is not equipped to do so.
Every intranet needs a true owner (no matter where it falls in the org chart) that acts a champion of the user. Call it the intranet manager, information architect, collaboration manager, or whatever you want, but their main job needs to be make life easier for users. Responsibilities include doing many of the things mentioned above like refining search, tweaking navigation, setting up a metadata structure, developing social tools (with a purpose), conducting usability tests, etc.
Unfortunately, with the proliferation of platforms like SharePoint, many IT departments roll out a system with no true ownership, so you end up with content chaos.

There is no need to add anything from Hans as he was re-iterating a previous comment about analytics which was captured already. Caleb makes a good point about ownership of content/intranet which is a governance issue in my book. So I added his contribution there…

image

Dena Gazin • @Suzanne. Yes, yes, yes – a big problem is old content. Spinning up new sites (SharePoint) and not using, or migrating sites and not deleting old or duplicative content. Huge problem! I’m surprised more people didn’t mention this. Here’s my three:
1. Metadata woes (@Erin – lack of robust metadata does sound better as improvements can be remedied on multiple levels)
2. Old or duplicate content (Data or SharePoint Governance)
3. Poorly configured search engine
Bonus reason: Overly complicated UIs. There’s a reason people like Google. Why do folks keep trying to mess up a good thing? Keep it as simple as you possibly can. Create views for those who need more. 80/20 rule!

Dena’s points are a reiteration of previous points, but I did like her “there is a reason people like google” point, which I considered to be a nice supporting argument of the entire user-centric design theme.

image

Next up we have another group of discussions. What is interesting here is that there is some disagreement – and a lot of prose – but not a lot of information was added to the map from it.

Luc de Ruijter • @Rob. Getting information and metastructures in place requires interactions with the owners of information. I doubt whether they are lazy or blank staring people – I have different experiences with engaging users in preparing digital working environments. People may stare back at you when you offer complete solutions they can say “yea” or “nay” to. And this is still common practice amogst Communication specialists (who like creating stuff themselves first and then communicate about them to others later). And if colleagues stare blank at your proposal, they obviously are resisting change and in need of some compelling communciation campaign…
Communication media legacy models are a root cause for failing intranets.
Tagging is indeed a complex excercise. And we come from a media-age in which fun predominated and we were all journalists and happy bunnies writing post after post, page after page, untill the whole cluttered intranet environment was ready again for a redesign.
Enterprise content is not media content, but enterprise content. Think about it (again please :-) ). If you integrate the storage process of enterprise content into the “saving as” routine, you’ll have no problems anymore with keeping your content clean and consistent. All wil be channeled through consistent routines. This doesn’t kill adding free personal meta though, it just puts the content in a enterprise structure. Think enterprise instead of media and tagging solutions are for grabs.
I agree that working on taxonomies can become a drag. Leadership and vison can speed up the process. And mandate of course.
I believe that the whole target rationale behind websites is part of the Communication media legacy we need to loose in order to move forward in better communication services to eployees. Target-thinking hampers the construction of effectve user centered websites, for it focusses on targets, persona, audiences, scenario’s and the whole extra paper media works.
While users only need flexibility, content in context, filters and sorting options. Filtering and sorting are much more effective than adding one navigation tree ater another. And they require a 180° turn in conventional communciation thinking.
@Caleb. Who manages an intranet?
Is that a dedicated team of intranet managers, previously known as content managers, previously known as communciation advisors, previously known as mere journalists? Or is intranet a community affair in which the community IS the manager of content? Surely you want a metamodel to be managed by a specialist. And make general management as much a non-silo activity as possible. Collaboration isn’t confined to silo’s so intranet shouldn’t be either.
A lot of intranets are run by a small group of ‘experts’ whose basic reasoning is that intranet is a medium like an employee magazine. If you want content management issues, start making such groups responsible for intranet.
In my experience intranets work once you integrate them in primary processes. Itranet works for you if you make intranet part of your work. De-medialise the intranet and you have more chance for sustainable success.
Rolling out Sharepoints is a bit like rolling back time. We’ll end up somewhere where we already were in 2001, when digital IC was IT policy. The fact that we are turning back to that situation is a good and worrying illustration of the fact that strategy on digital communications is lacking in the Communications department – otherwise they wouldn’t loose out to IT.
@Dena. I think your bonus reason is a specific Sharepoint reason. Buy Sharepoint and get a big bonus bag of bad design stuff with it – for free! An offer you can’t refuse :-)

Luc de Ruijter • @Dena. My last weeks tweet about search: Finding #intranet content doesn’t start with #search #SEO. It starts with putting information in a content #structure which is #searchable. Instead of configuring your search engine, think about configuring your content first.

Once again Luc is playing the devils advocate role with some broader musings. I might have been able to add some of this to the map, but it was mostly going over old ground or musings that were not directly related to the question being asked. This time around, Rob takes issue with some of his points and Caleb agrees…

Rob Faulkner • @Luc, Thanks for your thoughtful response, but I have to respectfully disagree with you on a few points. While my delivery may have been a bit casual, the substance of my post is based on experience.
First of all, my characterizations of users being 1) lazy or 2) blank staring we’re not related to the same topic. Lazy: in reference to tagging content. Blank Staring: related to looking to end users for organizational direction.
Lazy, while not the most diplomatic means of description, I maintain, does occur. I’ve experienced it, first hand. A client I’m thinking of is a major technology, Fortune 100 player with well over 100K tech-focused, internet savvy (for the most part) employees. And while they are great people and dedicated to their respective vocation, they don’t always tag documents and / or content-chunks correctly. It happens. And, it IS why a lot of content isn’t being located by targets — internally or externally. This is especially the case when knowledge or content management grows in complexity as result of content being repurposed for delivery via different vehicles. It’s not as simple as a “save as” fix. This is why I find many large sites that provide for search via pre-packed variables, — i.e. drop-downs, check-boxes, radio-buttons, etc — somewhat suspect, because if you elect to also engage in keyword index search you will, many times, come up with a different set of results. In other words, garbage in, garbage out. That being said, you asked “why,” not “what to do about it” and they are two completely different topics. I maintain that this IS definitely a potential “why.”
As far as my “blank stare” remark, it had nothing to do with the above, which you tied it to… but I am more than fluent in engaging and empowering content owners in the how’s and why’s of content tagging without confusing them or eliciting blank stares. While the client mentioned above is bleeding-edge, I also have vast experience with less tech-sophisticated entities — i.e. 13th-century country house hotels — and, hence, understand the need to communicate with contributors appropriate to what will resonate with them. This is Marketing 101.
In regard to the real aim of my “blank stare” comment, it is very germane to the content organization conversation in that it WILL be one of your results if you endeavour to ask end-users for direction. It is, after all, what we as experts should be bringing to the table… albeit being qualified by user sounding boards.
Regarding my thoughts on taxonomy exercises… I don’t believe I suggested it was a drag, at all. The fact is, I find this component of interactive strategy very engaging… and a means to create a defensible, differentiated marketing advantage if handled with any degree of innovation.
In any event, I could go on and on about this post and some of the assumptions, or misinterpretations, you’ve made, but why bother? When I saw your post initially, it occurred to me you were looking for input and perhaps insight into what could be causing a problem you’re encountering… hence the “why does this happen” tone. Upon reviewing the thread again, it appears you’re far more interested in establishing a platform to pontificate. If you want to open a discussion forum you may want to couch your topic in more of a “what are your thoughts about x, y, z?”… rather than “what could be causing x, y, z?” As professionals, if we know the causes we’re on track to address the problem.

Caleb Lamz • I agree with Rob, that this thread has gone from “looking for input” to “a platform to pontificate”. You’re better off making this a blog post rather than asking for input and then making long and sometimes off the cuff remarks on what everyone else has graciously shared. It’s unproductive to everyone when you jump to conclusions based on the little information that other users can provide in a forum post.

Luc de Ruijter • The list:
Adopting social media from a hype-driven motivation (lack of coherence)
big problem with people just PDFing EVERYTHING instead of posting HTML pages
Comms teams don’t listen and learn from all levels of the business
Content is not where someone thought it would be or should be or its not called what they thought it was called or should be called.
content is titled poorly
content managers either being lazy and doing a half ass job of tagging
content they are trying to find is out of date, cannot be trusted or isn’t even available on the intranet.
Documents have no naming convention
failure to offer an entry level view
inconsistency of how things are tagged
Inconsistent vocabulary and acronyms
info is organised by departmental function rather than focussed on end to end business process.
information being searched does not actually exist or exists only in an unrecognisable form and therefore cannot be found!
intranet’s lazy over-reliance on search
intranets are simply not actively managed, or the department managing it is not equipped to do so.
intranets of today are not social at all
just too much stuff
Lack of content governance, meta-data and inconsistent taxonomy, resulting in poor search capability.
Lack of measuring and feedback on (quality, performance of) the intranet
Lack of metadata
lack of motivation on the part of contributors to make their information easy to use
lack of quality measurements (e.g. number of likes, useful or not
lack of robust metadata
lack of robust metadata, resulting in poor search results;
lack of user-centred design
main navigation is poor
not fitting the fact that there are at least two types of visitors… task-driven and browse-based
Not making the search box available enough
Old content is not deleted and therefore too many results/documents returned
Old or duplicate content (Data or SharePoint Governance)
Overly complicated UIs
Poor navigation, information architecture and content sign-posting
Poorly configured search engine
proliferation of platforms like SharePoint
relevance of content (what’s hot for one is not for another)
Search can’t find it due to poor meta data
Search engine is not set up correctly
search engine returns too many results with no clear indication of their relative validity
structure is not tailored to the way the user thinks

Luc de Ruijter • This discussion has produced a qualitative and limited list of root causes for not finding stuff. I think we can all work with this.
@Rob & @Caleb My following question is always what to do after digesting and analysing information. I’m after solutions, that;s why I asked about root causes (and not symptoms). Reading all the comments triggers me in sharing some points of view. Sometimes that’s good to fuel the conversation sometimes. For if there is only agreement, there is no problem. And if there is no problem, what will we do in our jobs? If I came across clerical, blame it on Xmas.
Asking the “what to do with this input?” is pehaps a question for another time.

The only thing I added to the map from this entire exchange is Rob’s point of no social aspects to search. I thought this was interesting because of an earlier assertion that applying social principles to an intranet caused more silos. Seems Luc and Rob have differing opinions on this point.

image

Where are we at now?

At this point, we are almost at the end of the discussion. In this post, I added 25 nodes against 10 comments. Nevertheless, we are not done yet. In part 4 I will conclude the synthesis of the discussion and produce a final map. I’ll also export the map to MSWord, summarising the discussion as it happened. Like the last three posts, you can click here to see the maps exported in more detail.

There are four major themes that have emerged. Information Architecture, People aspects, Inadequate governance and lack of user-centred design. The summary maps for each of these areas are below (click to enlarge):

Information Architecture

part3map2[5]

People aspects

part3map3[5]

Inadequate Governance

part3map4[5]

Lack of user-centred design

part3map5[5]

Thanks for sticking with me thus far – almost done now…

Paul Culmsee

CoverProof29

www.sevensigma.com.au

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



Jan 15 2012

Why can’t users find stuff on the intranet? An IBIS synthesis–Part 2

Hi all

This is the second post in a quick series that attempts to use IBIS to analyse an online discussion. Strange as it may sound, but I believe that issue mapping and IBIS is one of the most pure forms of information architecture you can do. This is because a mapper, you are creating a navigable mental model of speech as it is uttered live. This post is semi representative of this. I am creating an IBIS based issue map, but I’m not interacting live with participants. nevertheless, imagine if you will, you sitting in a room with a group of stakeholders answering the question on why users cannot find what they are looking for on the intranet. Can you see its utility in creating shared understanding of a multifaceted issue?

Where we left off…

We finished the previous discussion with a summary map that identified several reasons why it is hard to find information on intranets. In this post we will continue our examination of this topic. What you will notice in this post is that the number of nodes that I capture are significantly less than in part 1. This is because some topics start to become saturated and people’s contributions are the same as what is already captured. In Part 1, I captured 55 nodes from the first 11 replies to the question. In this post, I capture an additional 33 nodes from the next 15 replies.

image_thumb72

So without further adieu, lets get into it!

Suzanne Thornley • Just another few to add (sorry 5 not 3 :-) :
1. Search engine is not set up correctly or used to full potential
2. Old content is not deleted and therefore too many results/documents returned
3. Documents have no naming convention and therefore it is impossible to clearly identify what they are and if they are current.
4. Not just a lack of metadata but also a lack of governance/training around metadata/meta tagging so that less relevant content may surface because the tagging and metadata is better.
5. Poor and/or low cost search engine is deployed in the mistaken belief that users will be happy/capable of finding content by navigating through a complex intranet structure.

Suzanne offered 5 additional ideas to the original map from where we last left off. She was also straight to the point too, which always makes a mappers job of expressing it in IBIS easier. You might notice that I reversed “Old content is not deleted and therefore too many results/documents returned” in the resulting map. This is because I felt that old content not being deleted was one of a number arguments supporting why too many results are returned.

image_thumb[26]

My first map refactor

With the addition of Suzanne’s contributions, I felt that it was a good time to take stock and adjust the map. First up, I felt that a lot of topics were starting to revolve around the notion of information architecture, governance and user experience design. So I grouped the themes of vocabulary, lack of metadata, excessive results and issues around structure of data as part of a meta theme of “information architecture”. I similarly grouped a bunch of answers into “governance” and “user experience design”. These for me, seemed to be the three meta-themes that were emerging so far…

For the trainspotters, Suzanne’s comment about document naming conventions was added to the “Vocabulary and labelling issues” sub-map. You can’t see it here because I collapsed the detailed so you can see the full picture of themes as they are at this point.

part2map1

Patrik Bergman • Several of you mention the importance of adding good metadata. Since this doesn’t come natural to all employees, and the wording they use can differ – how do you establish a baseline for all regarding how to use metadata consistently? I have seen this in a KM product from Layer 2 for example, but it can of course be managed without this too, but maybe to a higher cost, or?

Patrick’s comment was a little hard to map. I captured his point that metadata does not come natural to employees as a pro, supporting the idea that lack of metadata is an example of poor information architecture. The other points I opted to leave off, because they were not really related to the core question on why people can’t find stuff on the intranet.

image_thumb[7]

Luc de Ruijter • @Patrik. Metadata are crucial. I’ve been using them since 2005 (Tridion at that time).You can build a lot of functionality with it. And it requires standardisation. If everyone adds his own meta, this will not enable you to create solutions. You can standardize anything in any CMS. So use your CMS to include metadata. If you have a DMS the same applies. (DMS are a more logical tool for intranets, as most enterprise content exists as documenst. Software such as LiveLink can facilitate adding meta in the save as process. You just have to tick some fields before you can save a document on to the intranet.)
@Suzanne. There’s been a lot of buzz about governance. You don’t need governance over meta, you just need a sound metastructure (and a dept of function to manage it – such as library of information management). Basically a lot of ‘governance’ can be automated instead of being discussed all the time :-) .

Like Patricks comment, much of what Luc said here wasn’t really related to the question at hand or has been captured already. But I did acknowledge his contribution to the governance debate, and he specifically argued against Suzanne’s point about lack of governance around metadata tagging.

image_thumb[11]

Next we have a series of answers, but you will notice that most of the points are re-iterating points that have already been made.

Patrik Bergman • Thanks Luc. It seems SharePoint gives us some basic metadata handling, but perhaps we need something strong in addition to SharePoint later.

Simon Evans • My top three?
1) The information being searched does not actually exist or exists only in an unrecognisable form and therefore cannot be found!
2) As Karen says above, info is organised by departmental function rather than focussed on end to end business process.
3) Lack of metadata as above

Mahmood Ahmad • @Simon evan. I want to also add Poor Information Structure in the list. Therefore Information Management should be an important factor.

Luc de Ruijter • @Patrik. Sharepoint 2010 is the first version that does something with it. Ms is a bit slow in pushing the possibilities with it.
@Simon @Mahmood Let’s say that information structure is the foundation for an intranet (or any website), and that a lack of metadata is only a symptom of a bad foundation?

Patrik Bergman • Good thing we use the 2010 version then :D I will see how good it handles it, and see if we need additional software.

Erin Dammen • I believe 1) lack of robust metadata, resulting in poor search results; 2) structure is not tailored to the way the user thinks; 3) lack of motivation on the part of contributors to make their information easy to use (we have a big problem with people just PDFing EVERYTHING instead of posting HTML pages.) I like that in SP 2010, users have the power to add their own keywords and flag pages as "I like it." Let your community do some of the legwork, I think it helps!

Simon’s first point that the information searched may not exist or may not be in the right format was new, so that was captured under governance. (After all, its hard to architect information when its not there!).

image_thumb[16]

I also added Erin’s third point about lack of motivation on the part of contributors. I mulled over this and decided it was a new theme, so I added it to the root question, rather than trying to make it fit into information architecture, governance or user experience design. I also captured her point on letting the community do the legwork through user tagging (known as folksonomy).

image_thumb[19]

Luc de Ruijter • @all. The list of root causes remains small. This is not surprising (it would be really worrying if the list of causes would be a long list). And it is good to learn that we encounter the same (few but not so easy to solve) issues.
Still, in our line of work these root causes lack overall attention. What could be the reason for that? :-)
@Erin Motivation is not the issue, I think; and facilitation is. If it is easier to PDF everything, than everyone will do so. And apparently everyone has the tools to do so. (If you don’t want people to PDF stuff, don’t offer them the quick fix.)
If another method of sharing documents is easier, then people will migrate. How easy is it to find PDF’s through search? How easy is it to add metadata to PDF’s? And are colleagues explained why consistent(!) meta is so relevant? Can employees add their own meta keywords? How do you maintain the quality and integrity of your keywords?
Of course it depends on your professional usergroup whether professionals will use "I like" buttons. Its a bit on the Facebook consumer edge if you’d ask me. Very en vogue perhaps, but in my view not so business ‘like’.

Luc, who is playing the devils advocate role as this discussion progresses, provides three counter arguments to Erin’s argument around user motivation. They are all captured as con’s.

image_thumb[21]

Steven Osborne • 1) Its not there and never was
2) Its there but inactive so can no longer be accessed
3) Its not where someone thought it would be or should be or its not called what they thought it was called or should be called.

Marcus Hamilton-Mills • 1) The main navigation is poor
2) The content is titled poorly (e.g internal branding, uncommon wording, not easy to differentiate from other content etc.)
3) Search can’t find it due to poor meta data

patrick c walsh • 1) Navigation breaks down because there’s too much stuff
2) There’s too much crap content hidden away because there’s just too much stuff
and
3) er…there’s just too much stuff

Mark Smith • 1. Poor navigation, information architecture and content sign-posting
2. Lack of content governance, meta-data and inconsistent taxonomy, resulting in poor search capability.
3. The content they are trying to find is out of date, cannot be trusted or isn’t even available on the intranet

Luc de Ruijter • @Steven Had a bit of a laugh there
@all Am I right in making the connection between
- the huge amount of content is an issue
- that internal branding causes confusion (in labeling and titles).
and
the fact that – in most cases – these causes can be back tracked to the owners of intranet, the comms department? They produce most content clutter.
Or am I too quick in drawing that conclusion?

Now the conversation is really starting to saturate. Most of the contributions above are captured already in the map as it is, so I only added two nodes: Patrick’s point about navigation (an information architecture issue) and too much information.

image_thumb[25]

Where are we at now?

We will end part 2 with a summary below. Like the first post, you can click here to see the maps exported in more detail. In part 3, the conversation got richer again, so the maps will change once again.

Until then, thanks for reading

Paul Culmsee

CoverProof29

www.sevensigma.com.au

part2map2

 Digg  Facebook  StumbleUpon  Technorati  Deli.cio.us  Slashdot  Twitter  Sphinn  Mixx  Google  DZone 

No Tags



« Previous PageNext Page »

Today is: Wednesday 19 June 2013 |