Johan Åhlén

Johan Åhlén's blog about life, universe and everything.

September 2009 - Posts

A first look at SQL Server 2008 R2 - StreamInsight

Background

A couple of weeks ago, the second CTP for SQL Server 2008 R2 was released. I downloaded it immediately. Still the product is far from finished, and the CTP didn't contain any preview of the "Gemini" component, but I could at least test the StreamInsight part. The finished product is scheduled to be released during the first half of year 2010, according to Microsoft's web site. [Notice: This article has been updated for the November CTP of StreamInsight]

What's new in SQL Server 2008 R2? I would say the most important news are:

  • StreamInsight
  • Managed Self Service Business Intelligence (the so called "Gemini" project)
  • Application and Multi-Server Management

This blog post will focus on StreamInsight.

StreamInsight

What is StreamInsight and what is it good for? StreamInsight is a platform for developing and deploying applications that handle high-speed streaming data. It could be used for near real-time processing of data from production environments, structured data such as financial information and unstructured data such as Facebook, Twitter and blogs. Multiple sources can be combined and refined before they are being output. This technology is called CEP - complex event processing.

Combine StreamInsight with data mining, and you can have real-time fraud detection, stock market prediction, you name it... Imagine a real-time Business Intelligence-application where the management can actually see the KPI gauges moving on the screen. I would say that monitoring real-time flow of information is one of the key success factors for tomorrow's Business Intelligence. This is also supported by Ralph Kimball, saying in an interview that Business Intelligence is moving from the strategic level towards operational level processes such as customer support, logistics and sales. At the operational level data must be continuously updated, not just once per day. I would add also that the new generation, that has grown up with Facebook and Twitter, will make it necessary to monitor new sources for successful Business Intelligence.

How does it work?

You develop your application in Visual Studio 2008, using libraries from StreamInsight. Basically you develop (or use existing) input and output adapters and connect them through LINQ queries. The picture below (copied from Microsoft's StreamInsight documentation) shows a good overview of the product:

(Click on image to enlarge)

There are three different development models supported by StreamInsight:

  • Explicit model, which provides full control and access to the CEP-server.
  • Implicit model, which can handle only a single query and runs all in memory. Good for debugging.
  • IObservable/IObserver interface, which hides much of the complexity of the CEP-server API

All data in StreamInsight is organized into event streams. The events are of two kinds:

  • INSERT events, that contains a header with timestamp information and a payload with application specific data. These events can be of different models, such as Interval, Point and Edge, to support different temporal characteristics.
  • CTI events, that indicates the completeness of previous events in the stream.

Adapters can be of two types:

  • Untyped adapters that work with multiple types of INSERT events
  • Typed adapters that work with a single type of INSERT events

Queries are written in LINQ with the possibility to do calculations, filtering, grouping, windowing, aggregations, combining and invoke user-defined functions. A query template is bound to specific input and output adapters to produce a query instance.

An example - building an input adapter for Yahoo Finance stock quotes

For testing StreamInsight, I chose to build an input adapter that reads stock quotes from Yahoo Finance. Note however that quotes from Yahoo Finance are delayed at least 15 minutes unless you pay for their premium service. To test the input adapter I also built an output adapter the displays the quotes on the screen (just a simple console application). These I joined together in an application that I built using the implicit development model.

The point here is not the functionality of the application, but that you could easily replace the output adapter with something much more useful. By adding SQL Server 2008 Data Mining, using any algorithm for predictive analysis, you could build something much more useful. Adding an output adapter that places buy and sell orders into a trading system could make you rich. Maybe I will write about that in a later blog, maybe not...

The picture below shows the Visual Studio 2008 project.

(Click on image to enlarge)

ScreenScraper is a helper class that reads the HTML contents of a web page and extracts a string using pattern matching (a regular expression). YahooFinance is a helper class that provides the ScreenScraper with suitable URL and pattern matching expression for the desired stock symbol. StockPayload contains the actual data and is defined as below:

public class StockPayload
{
   ///
   /// Unique ID of stock or index
   /// 
   public string StockID { get; set; }

   ///
   /// Current value of stock or index
   /// 
   public double Value { get; set; }
}

The data members can only be basic types such as numbers, strings, etc. This is probably because the serialization mechanism in StreamInsight requires a fixed size payload.

The initial rows of the input adapter is shown below.

(Click on image to enlarge)

Note that the input adapter inherits generic class TypedPointInputAdapter, which is used for typed (single kind of event) point (point in time events) adapters. Basically the adapter has to implement methods for Start() and Resume(), monitor the AdapterState and enqueue events to the output stream. The input adapter can work with asynchronous sources through callback functions, and synchronous sources. I've implemented the input adapter as a loop that reads data from Yahoo Finance and then waits for a determined time. A CTI event is sent after each quote, to indicate there are no other events awaiting.

A factory class is provided to instantiate the adapters. The factory implements a Create method that takes a configuration and an event shape as parameter. Based on these parameters it instantiates the input adapter.

The output adapter is built similarly as the input adapter, so I don't describe it here. The adapters are tied together by a query defined by the following statements. Note that this solution is specific for the implicit development model. In the explicit model, the query would instead be bound through the CreateQuery method of the application.

// Join input adapters with a simple LINQ query
var combinedInputStream = (from e in input1Stream
   select e).Union
         (from e in input2Stream
   select e);

// Connect input adapters with output adapter
var query = combinedInputStream.ToQuery(outputSink);

Finally the query is being run:

// Run the query
query.Start();
adapterStopSignal.WaitOne();
query.Stop();

Testing and debugging

SQL Server 2008 R2 StreamInsight comes with a tool, StreamInsight Debugger, that can visualize a recorded event flow. To use it you first need to write tracing data to a file. Enable tracing, run your application, and the disable tracing. This is done from command-line as shown below:

Opening the trace data file, sample.etl, in StreamInsight Debugger you can monitor the stream pipeline. The picture below shows data from the input adapter instances and how they are unioned in the LINQ query.

(Click on image to enlarge)

You even have an option in StreamInsight to replay the sequence so you can watch it like a movie. To do that you click on the clock icon in the toolbar.

Summary

Working with near real-time data is very different from traditional database development such as data warehouses, which are sluggish and unfit for real-time data. Therefore a new approach is required. StreamInsight looks promising as a technology for monitoring and responding to new data sources such as production systems, Facebook, Twitter and blogs.

Source code

The source code for my sample application, including the input adapter for Yahoo Finance, is available on the following link:

Unleashing the power of SQL Server ENTERPRISE EDITION - Part 1

Background

Having seen a number of MS SQL Server Business Intelligence implementations, I have noticed that surprisingly few consultants know the difference between SQL Server Enterprise Edition and SQL Server Standard Edition, and even fewer use the Enterprise Edition features. I think this may be partly because of a lack of good examples available, and therefore decided to try to write at least one example.

Below is an extract from a comparison sheet between Standard and Enterprise edition:

This article will focus on the usage of semi-additive measures.

The Inventory problem

The Inventory problem is a typical example of a semi-additive measure. Imagine that you have an ERP-system that tracks inventory, recording every time an item is in- or out of the inventory. From the ERP-system you get a daily net-change for each item as shown below:

However, what the management is interested in is not the net changes, but the quantity available at the end of every day:

When looking at a specific day, month or other period, the quantity available is the sum of all net changes from beginning of time to the end of the specific period. This is a closing balance.

The definition of a semi-additive measure is that it behaves differently in different dimensions. A closing balance is semi-additive because it behaves differently in the time-dimension than other dimensions. Other examples of semi-additive measures are: opening balance, average and by account.

An example - Balance Sheet

Another situation, apart from inventories, where you want to work with opening and closing balances is Balance Sheets. When extracting general ledger data from an ERP-system you will get the net transactions. However, financial people are used to see opening balances, net changes and closing balances. This example will show how to produce closing balances, although opening balances can be produced very easily by small modifications to this example.

Consider the following star-schema for describing the Balance Sheet:

The fact-table is filled with transactional data (not necessarily in time-order):

The Day_Key is expressed in ISO-format (YYYYMMDD), and the Month_Key and Year_Key are made up by simply removing lower parts:

Dim_Account is a very simplified chart of accounts (based on EU-standards):

Natural keys are used in the relation between Fact_GL and the dimension tables, whereas in the real-world surrogate keys should be used instead.

SQL-scripts for setting up this demo environment can be found here:

The SQL-scripts assume that you have created a database named PRECALCDB that you use to run them.

Calculating the closing balances

Calculating the closing balances can be done in two ways: calculating them in dynamically in the OLAP Cube, or precalculating them in the relational database. We will try both in this example.

To precalculate the closing balances in the relational database, we create a new table to hold the calculated values, Fact_GL_CB, the is similar to the previous fact-table but contains field Amount_CB instead of Amount and also has an extra Date-file (not really necessary, but simplifies a bit):

We now need to fill Fact_GL_CB with closing balances. It should have a value for every combination of Company, Account and Day_Key, that is the sum of all amounts having a date less than or equal to the Day_Key. This can be accomplished using while-loops, cursors, etc. However I prefer writing a recursive SQL-query instead. My solution is in a few steps:

Initialise a range-table, telling the recursive query which dates to handle:

TRUNCATE TABLE Fact_GL_CB;

SELECT DISTINCT
   Company_Key,
   Account_Key,
   (SELECT MIN([Date]) FROM Fact_GL) AS [Date],
   (SELECT MAX([Date]) FROM Fact_GL) AS [MaxDate]
INTO #Range
FROM Fact_GL
WHERE Account_Key < N'3';

Note that this step also deletes any previous data from Fact_GL_CB and filters out transactions not belonging to the Balance Sheet (having Account_Key starting with 3 or higher). Using the name #Range ensures it is written to tempdb instead of PRECALCDB.

Next step is to create a daily sums table, holding the sum of all transactions each day:

WITH
AllDays AS
(
   SELECT
      Company_Key,
      Account_Key,
      [Date],
      [MaxDate]
   FROM #Range

   UNION ALL

   SELECT
      Company_Key,
      Account_Key,
      DATEADD(DAY, 1, [Date]),
      [MaxDate]
   FROM AllDays
   WHERE [Date] < [MaxDate]
)
SELECT
   AllDays.Company_Key,
   AllDays.Account_Key,
   AllDays.[Date],
   ISNULL(SUM(Amount), 0) [Amount]
INTO #DailySums
FROM AllDays
LEFT JOIN Fact_GL ON AllDays.Company_Key = Fact_GL.Company_Key AND AllDays.Account_Key = Fact_GL.Account_Key AND AllDays.[Date] = Fact_GL.[Date]
GROUP BY AllDays.Company_Key, AllDays.Account_Key, AllDays.[Date]
OPTION (MAXRECURSION 0);

Note the option MAXRECURSION which is necessary to prevent the AllDays common table expression to hit the default recursion depth limit.

Having calculated the daily sums, the closing balances can be calculated using a recursive query that accumulates daily sums:

WITH
Accumulator AS
(
   SELECT
      Company_Key,
      Account_Key,
      [Date],
      [Amount]
   FROM #DailySums
   WHERE [Date] = (SELECT MIN([Date]) FROM #Range)

   UNION ALL

   SELECT
      #DailySums.Company_Key,
      #DailySums.Account_Key,
      #DailySums.[Date],
      #DailySums.[Amount] + Accumulator.[Amount]
   FROM #DailySums
   JOIN Accumulator
   ON
      Accumulator.Company_Key = #DailySums.Company_Key AND
      Accumulator.Account_Key = #DailySums.Account_Key AND
      DATEADD(DAY, 1, Accumulator.[Date]) = #DailySums.[Date]
)
INSERT INTO Fact_GL_CB(Amount_CB, [Date], Day_Key, Company_Key, Account_Key)
SELECT
   [Amount],
   [Date],
   CONVERT(INT, CONVERT(NVARCHAR, [Date], 112)) [Day_Key],
   Company_Key,
   Account_Key
FROM Accumulator
ORDER BY Company_Key, Account_Key, [Date]
OPTION (MAXRECURSION 0);

Finally, the performance can be improved by adding an index on #DailySums between step 2 and 3.

The complete SQL-script for the aggregation process can be found here:

Building the cube

Now we are finished with the relational database part and continue with the cube. We start an Analysis Services project in BIDS and create a Data Source View as below:

(click on image to enlarge)

The next step is to create a cube and dimensions. This is most easily done by running the New Cube wizard. Make sure you select both Fact_GL and Fact_GL_CB as measure group tables. Select the dimensions. Be ready to make some adjustments to the dimensions (the wizard isn't too clever).

We also need to do minor adjustments to the measures. Change the format so that they are displayed as Currency or similar. Rename "Amount CB" to "Amount CB Precalc" and change the aggregation function to LastNonEmpty:

(click on image to enlarge)

LastNonEmpty is basically the equivalent of a closing balance. Another option would be to use LastChild, which differs only in the handling of missing values.

Also we add a calculated measure, "Amount CB Cube", which dynamically calculates the closing balance in the cube (no need for any Fact_GL_CB table). This measure can easily be created as follows:

(click on image to enlarge)

The full Analysis Services-project can be found here:

Build and deploy the solution. Time to test...

Results

To evaluate the results, I have used MDX Studio by Mosha Pasumansky. This is an essential tool that any serious Microsoft OLAP-developer should have.

The following MDX-query can be used to test. Swap [Amount CB Cube] for [Amount CB Precalc] to test the precalculated values instead of the dynamic values.

SELECT
   {Account.Assets.CHILDREN, Account.Liabilities.CHILDREN} *
   {[Amount], [Amount CB Cube]} ON COLUMNS,
   [Time].Day.MEMBERS ON ROWS
FROM PRECALCDB

Using MDX Studio to profile this query gives us the following performance for dynamically calculated closing balances:

(click on image to enlarge)

Testing the precalculated closing balances gives us this performance:

(click on image to enlarge)

As you can see, the precalculated closing balances are about 10 times faster when not cached. When cached the performance difference is insignificant, as could be expected since there is no need then to do any calculations.

However, the performance is not the only reason for using precalculations. It may not even be the most important reason. Calculating the closing balance dynamically in the cube (using a calculated measure) has several disadvantages, for example:

  • Drill-through actions will not work
  • Problems may arise with solve-order
Therefore the use of the Enterprise Edition feature, semi-additive measures, is much preferred when possible. Especially in the case that you already have closing balances or opening balances from the data source. 

Source code

Tired of getting unwanted text formatting when you copy/paste?

Try my new application PlainText. Download it from here.

Using PlainText you can copy text from any application and paste as unformatted text with just a simple keyboard shortcut. Just the same as if you had pasted to Notepad and copied from there.

This is a feature I have been missing in Windows, so I decided to finally do something about it!

Posted: Sep 03 2009, 10:11 PM by jahlen | with 3 comment(s)
Filed under: ,
Johan Åhlén starts blogging

It seems that everyone else has a Blog, so it is time for me to start also. Some of you know me already - the rest of you are asking who am I now.

I am a self-employed Business Intelligence professional who live in Sweden (Malmö area). I am a MCITP on Microsoft SQL Server 2008 and 2005, although I actually do a fair amount of anything Microsoft and .NET. Also typically I tend to work in the borderland between IT and Business. I was one of the founders of Entreprenörsdagen. I have also founded a 3D-mouse company and a few other businesses. This has given me the opportunity to try a few different business roles.
 
I live with my wonderful wife and 1,5 year old daughter. When not working I enjoy playing at my grand piano (which formerly belonged to famous Swedish musician Östen Warnerbring), reading Robert Jordan's books and renovating my summer house.

In this Blog I will write mostly technical articles and thoughts about Microsoft SQL Server, Microsoft Oslo and Business Intelligence related technologies. However feel free to discuss anything with me.

Hope to see you soon again!

Posted: Sep 03 2009, 03:35 PM by jahlen
Filed under: ,