Leon's Weblog

October 27, 2013

SQL Server Full Text Summary

Filed under: Software Dev — Leon @ 11:23 am

Text Search Microsoft SQL Server has a convenient Full-Text search capability which is powerful and fairly easy to setup. There is, however, one glaring feature which appears to be missing. SQL Server can rank the search results but it has no built in way to summarize them. So when you store large documents in a table, you can easily tell a user which of these documents match a particular search string but you can’t easily tell the user where to look inside those document. Imagine using an internet search engine and getting a list of links without a summary of those pages.

I figured that I can’t be the first to need this functionality and expected to find plenty of solutions online. One solution that I thought looked promising is the Hit Highlight user defined function. It is easy to setup but I didn’t like the resulting summary or the performance hit (the full-text search runs faster than the summary). Another solution that I found was the ThinkHighlight function which improves performance a bit by implementing the code in a CLR Assembly but it is not free. This site compared the performance of these two solutions and it didn’t seem like the performance improvement of the CLR Assembly justified the cost. Since I didn’t find the solutions that I was looking for, I set out to make one of my own.

I wanted a generic function that would work on the SQL Server and can be used to summarize a large text field based on a search string. I would call this function on the full-text search results table. The function has three main sections. First, we need to split the search string into terms. This part is actually not trivial because SQL Server performs a semantic and thesaurus look-up when performing a full-text search and we would want to do the same otherwise we may not find why a particular record was identified by the search. The second step is finding the search terms in the document. Finally, the last step is to find the biggest cluster of search terms in the document and return that part of the document. I did not want to modify (e.g. add markup) to the resulting document summary because I think that is best done as part of the interface logic not in the database.

Splitting the search string into terms can be done using the built in sys.dm_fts_parser function. Unfortunately this function requires sysadmin privileges to execute. Users have long complained about this limitation to Microsoft but no changes were made. Fortunately, there is a workaround. It involves creating a user defined function that wraps sys.dm_fts_parser and signing it with a certificate. While a PITA, it is better than granting sysadmin privileges to your application logic. I created a user defined function instead of a stored procedure that the workaround uses because I though it is easier to use; however, the rest of the logic applies the same way. The user defined function looks like this:

	CREATE FUNCTION [dbo].[fts_parser] (
		@query nvarchar(max)
	)
	RETURNS @Parser TABLE (
		keyword varbinary(128),
		special_term nvarchar(4000),
		display_term nvarchar(4000)
	)
	WITH EXECUTE AS CALLER
	AS
	BEGIN
		INSERT INTO @Parser (keyword, special_term, display_term)
		SELECT keyword, special_term, display_term
		FROM sys.dm_fts_parser(@query, 1033, 0, 0);
	
		RETURN
	END

The next step is to find each position of each search term in the document. Unfortunately, I had to implement it with cursors. The logic is pretty straight forward.

	--get list of terms
	DECLARE cur1 CURSOR FOR
	SELECT display_term
	FROM dbo.fts_parser(N'FORMSOF(FREETEXT, "' + @searchTerm + '")')
	WHERE special_term='Exact Match'

	--find position of each term in contents
	OPEN cur1
	FETCH NEXT FROM cur1 INTO @term

	WHILE @@FETCH_STATUS = 0
	BEGIN
		WHILE 1 = 1
		BEGIN
			SET @p = CHARINDEX(@term, @contents, @p)
			IF @p <= 0 BREAK
	
			INSERT INTO @lb1 VALUES (@p)
	
			SET @p = @p + LEN(@term)
		END
    
		FETCH NEXT FROM cur1 INTO @term   
	END   

	CLOSE cur1
	DEALLOCATE cur1

The last step also uses cursors (such is life). The idea is to return a 200 character summary that is most representative of the search. In my case, that meant that the summary would contain the most search terms. My implementation is not the most elegant algorithm for finding clusters in arrays (likely the least efficient brute-force approach) but it is hard to do much better within the limitation of T-SQL. Also, the logic can be improved to separate the summary on word boundaries instead of at arbitrary positions in the string but I preferred to implement that logic with the interface code. Ultimately, this is just and example and you will likely want to modify it to suit your needs.

	--find term cluster
	DECLARE @lb2 TABLE (pos int, freq int)

	INSERT INTO @lb2
	SELECT pos, 0 FROM @lb1	GROUP BY pos

	DECLARE cur2 CURSOR FOR
	SELECT pos FROM @lb2

	OPEN cur2
	FETCH NEXT FROM cur2 INTO @p

	WHILE @@FETCH_STATUS = 0
	BEGIN
		UPDATE @lb2
		SET freq = (SELECT COUNT(*) FROM @lb2 WHERE ABS(pos-@p)<100)
		WHERE pos=@p
    
		FETCH NEXT FROM cur2 INTO @p   
	END   

	CLOSE cur2
	DEALLOCATE cur2

	--return 200 character string based on term cluster center
	DECLARE @ClusterCenter INT
	SET @ClusterCenter = (
		SELECT TOP 1 pos FROM @lb2 
		WHERE freq = (SELECT MAX(freq) FROM @lb2)
		ORDER BY pos)
	IF @ClusterCenter<100
		SET @ClusterCenter = 100

	RETURN SUBSTRING(@contents, @ClusterCenter-100, 200)

Code
The complete logic for the Full-Text Search Summary function can be found below:
SummarizeFileIndex.sql

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment