When I read several months back, that Microsoft was going to supply an API for external storage in MOSS/WSS, I sprang from my desk and danced around the room and babbled incoherently as if I’d been touched by Benny Hinn.
Okay, well maybe I didn’t quite do that. But what I did do was forward the KB article to a colleague whose company is the leading reseller for EMC Documentum in my town. We’d previously had one of those conversations over a few beers where we questioned the wisdom of SharePoint’s potentially unwise, yet unsurprising use of SQL Server as the storage engine.
…so what’s wrong with SQL?
I am going to briefly dump on SQL here for a minute. But first, let me tell you, I actually like SQL Server! Always have. I hated other Office Server products like Exchange until the 2003 version and SharePoint until the 2007 version. But on the whole, I found SQL to be pretty good. So hopefully that will stop the SQL fanboys from flaming me!
Those readers who appreciate capacity planning issues would appreciate the challenges SQL based storage brings to the table.Additionally, those who have used Enterprise Information Management products like Documentum or Hummingbird (now OpenText) would nod as if Microsoft have finally realised the error of their ways with this updated API.
All of the SharePoint goodies like version control, full text indexing and records management come at a price. Disk consumption and performance drain. Microsoft say to plan for 1.5 times your previous growth in disk usage. In my own real-world results it is more like 2.5 times previous growth. Disk I/O is also increased markedly.
“So what? Disk is cheap”, you reply. Perhaps so, but the disk itself was never the major cost. Given that this is a SQL database we are talking about, a backup of a 100 gigabyte SQL database could take hours and a restore possibly longer. A differential backup of a SQL database would be the entire 100 gig as it is generally one giant database file! So the whole idea of differential backups during the week and full backups on weekends suddenly has to be re-examined. Imagine a disk partition gets corrupted rendering the data useless. In a file server, this might mean 20% of shared files are unavailable while a restore takes place. In a SQL world, you have likely toasted the whole thing and a full restore is required. Often organisations overlook the common issue of existing backup infrastructure not being scalable enough to deal with SQL databases of this size.
“100 gig”, you scoff, don’t be ridiculous”. Sorry but I have news for you. At an application level, there is a scalability issue in that the lowest logical SharePoint object that can have its own database is a Site Collection, not an individual site. For many reasons, it is usually better to use a single site collection where possible. But if one particular SharePoint site has a library with a lot of files, then the entire content database for the site collection is penalised.
Now the above reasons may be the big ticket items that vendors use to sell SAN’s and complex Backup/Storage solutions, but that’s not the real issue.
The real issue (drumroll…)
It may come as a complete shock to you, but documents are not all created equal. No! Really? 🙂 If they were, those crazy cardigan wearing, file-plan obsessed document controllers and records managers wouldn’t exist. But as it happens, different content is handled and treated completely differently, based on its characteristics.
Case in point: Kentucky Fried Chicken would have some interesting governance around the recipe for its 11 herbs and spices, as would the object of Steve Balmer’s chair throwing with their search engine page ranking algorithms.
I picked out those two obvious examples to show an extreme in documents with high intrinsic value to an organisation. The reality is much more mundane. For example, you may be required by law to store all financial records for seven years. In this day and age, invoices can be received electronically, via fax or email. Once processed by accounts payable, invoices largely have little real value.
By using SQL Server, Microsoft is in effect allocating an identical cost of each document in terms of infrastructure cost. Since all documents of a site collection reside inside a SQL content database, you have limited flexibility to shift lower value documents to lower cost storage infrastructure.
How do the big boys do it then?
Documentum as an example stores the content itself in traditional file shares and then stores the name and location of that document (and any additional metadata) in the SQL database. Those of you who have only seen SharePoint may think this is a crazy idea and introduce much more complex disaster recovery issues. But the reality is the opposite.
Consider the sorts of things you can do with this set-up. You can have many file shares on many servers or SAN’s. Documentum for example, would happily allow an administrator to automatically move all documents not accessed in three months to older, slower file storage. It would move the files and then update the file location in SQL so the new location is hidden from the user and they don’t even know it has been moved. Conversely, documents on older, slower storage that have been accessed recently can be moved back to the faster storage automatically.
It also facilitates geographically dispersed organisations having a central SQL repository for the document metadata, but each remote site has a local file store, to make retrieval work at LAN speeds for most documents. This is a much simpler geographically dispersed scenario than SharePoint can ever do right now.
Restores from backup are quite simple. If a file server corrupts, it only affects the documents stored on that file server. Individual file restores are easy to perform and you don’t have to do a major 100gig database restore for a few files.
Furthermore, documents that have a compliance requirement, but do not need to be immediately available, can easily be archived off to read-only media, thus reducing disk space consumption. The metadata detail of the file can still be retrieved from the SQL database, but location information in the SQL database can now refer to a DVD or tape number.
For this reason, it is clear that SharePoint’s architecture has some cost and scalability limitations when it comes to disk usage and management, largely due to SQL Databases and the limitation of Site Collections for content databases.
So how can we move less valuable documents onto less expensive disk hardware? Multiple databases? Possibly, but that requires multiple site collections and this complicates your architecture significantly. (Doing that is the Active Directory equivalent of using separate forests for each of your departments).
Note to SharePoint fanboys: I am well aware that you can ‘sort of’ do some of this stuff via farm design, site design and 3rd party tools. But until you have seen an high end enterprise content management system, there is no contest.
So you might wonder why SharePoint is all the rage then – even for organisations that already have high end ECM systems? Well the short answer is other ECM vendors’ GUIs suck balls and users like SharePoint’s front end better. (And I am not going to provide a long answer 🙂 )
As I said at the start of this post, I was very happy to hear about Microsoft’s external storage API. In my mind’s eye, I envisaged a system where you create two types of document libraries: ‘standard’ document libraries that use SQL as the store and ‘enhanced’ document libraries that look and feel identical to a regular document library, but it stores the data outside of SQL. Each ‘enhanced document library’ would be able to point to various different file stores, configured from within the properties of the document library itself.
Utopia my butt!
Then a few weeks back some more detail emerged in SDK documentation and my dream was shattered. This really smells like a “just get version 1 out there and we will fix it properly in version 2” release. I know all software companies partake in this sales technique, but it is Microsoft we are talking about here. Therefore it it my god given right…no…my god given privilege to whine about it as much as I see fit.
Essentially this new feature defines an external BLOB store (EBS). The EBS runs parallel to the SQL Server content database, which stores the site’s structured data. You will note that this is pretty much the Documentum method.
In SharePoint, you must implement a COM interface (called the EBS Provider) to keep these two stores in sync. The COM interface recognizes file Save and Open commands and invokes redirection calls to the EBS. The EBS Provider also ensures that the SQL Server content database contains metadata references to their associated BLOB streams in the external BLOB store.
You install and configure the EBS Provider on each Web front end server in your farm. In its current version, external BLOB storage is supported only at the scope of the farm (SPFarm).
Your point being?
If you haven’t realised why I marked the previous sentence in bold, it is this. Since this EBS provider can only be supported at farm scope, then every document library on every site on every site collection in your farm now saves its data via the EBS provider.
So there is utterly nil granularity with this approach. It’s an all or nothing deal. (There goes my utopian dream of two different types of document libraries). All of your documents in the farm are doing to be stored in this EBS provider!
But it gets worse!
The external BLOB storage feature in the present release will not remain syntactically consistent with external BLOB storage technology to be released with the next full-version release of Microsoft Office and Windows SharePoint Services. Such compatibility was not a design goal, so you cannot assume that your implementation using the present version will be compatible with future versions of Microsoft Office or Windows SharePoint Services.
So basically, if you invest time and resources into implementing an EBS provider, then you’re pretty much have to rewrite it all for the next version. (At least you find this out up front).
No utility is available for moving BLOB data from the content database into the external BLOB store. Therefore, when you install and enable the EBS Provider for the first time, you must manually move existing BLOBs that are currently stored in the content database to your external BLOB store.
Okay that makes sense. It is annoying but I can forgive that. Basically if you implement an EBS provider and enable it, your choices for migrating your existing content into it is a backup and restore/overwrite operation or simply wait it out and allow the natural process of file updates do the job for you.
When using an external BLOB store with the EBS Provider, you must re-engineer your backup and restore procedures, as well as your provisions for disaster recovery, because some backup and restore functions in Windows SharePoint Services operate on the content database but not on the external BLOB store. You must handle the external BLOB store separately.
I would have preferred Microsoft to flesh this statement out, as this will potentially cause much grief if people are not aware of this. It implies that STSADM isn’t going to give you the sort of full-fidelity backup that you expect. Yeouch! I feel I might get a few late night call-outs on that one!
Ah, but wait a minute there, sunshine, is that any different to now? STSADM backup and restore is not exactly rock solid now!
Any error conditions, resource drag, or system latency that is introduced by using the EBS Provider, or in the external BLOB store itself, are reflected in the performance of the SharePoint site generally.
Yeah whatever, this is code word for Microsoft’s tech support way of getting out of helping you. “I’m sorry sir, but call your EBS vendor. Thank you come again!”
I can’t say I am surprised at this version 1 implementation, but I am disappointed. If only the granularity extended to a site collection or better still an individual site, I could forgo the requirement to extend it to individual document libraries or content types.
So it will be interesting to see if this API gets any real uptake and if it does, who would actually use it!