SQL Server Full Text Summary
Microsoft SQL Server has a convenient Full-Text search capability which is powerful and fairly easy to setup. There is, however, one glaring feature which appears to be missing. SQL Server can rank the search results but it has no built in way to summarize them. So when you store large documents in a table, you can easily tell a user which of these documents match a particular search string but you can’t easily tell the user where to look inside those document. Imagine using an internet search engine and getting a list of links without a summary of those pages.
I figured that I can’t be the first to need this functionality and expected to find plenty of solutions online. One solution that I thought looked promising is the Hit Highlight user defined function. It is easy to setup but I didn’t like the resulting summary or the performance hit (the full-text search runs faster than the summary). Another solution that I found was the ThinkHighlight function which improves performance a bit by implementing the code in a CLR Assembly but it is not free. This site compared the performance of these two solutions and it didn’t seem like the performance improvement of the CLR Assembly justified the cost. Since I didn’t find the solutions that I was looking for, I set out to make one of my own.
I wanted a generic function that would work on the SQL Server and can be used to summarize a large text field based on a search string. I would call this function on the full-text search results table. The function has three main sections. First, we need to split the search string into terms. This part is actually not trivial because SQL Server performs a semantic and thesaurus look-up when performing a full-text search and we would want to do the same otherwise we may not find why a particular record was identified by the search. The second step is finding the search terms in the document. Finally, the last step is to find the biggest cluster of search terms in the document and return that part of the document. I did not want to modify (e.g. add markup) to the resulting document summary because I think that is best done as part of the interface logic not in the database.
Splitting the search string into terms can be done using the built in sys.dm_fts_parser
function. Unfortunately this function requires sysadmin
privileges to execute. Users have long complained about this limitation to Microsoft but no changes were made. Fortunately, there is a workaround. It involves creating a user defined function that wraps sys.dm_fts_parser
and signing it with a certificate. While a PITA, it is better than granting sysadmin
privileges to your application logic. I created a user defined function instead of a stored procedure that the workaround uses because I though it is easier to use; however, the rest of the logic applies the same way. The user defined function looks like this:
CREATE FUNCTION [dbo].[fts_parser] ( @query nvarchar(max) ) RETURNS @Parser TABLE ( keyword varbinary(128), special_term nvarchar(4000), display_term nvarchar(4000) ) WITH EXECUTE AS CALLER AS BEGIN INSERT INTO @Parser (keyword, special_term, display_term) SELECT keyword, special_term, display_term FROM sys.dm_fts_parser(@query, 1033, 0, 0); RETURN END
The next step is to find each position of each search term in the document. Unfortunately, I had to implement it with cursors. The logic is pretty straight forward.
--get list of terms DECLARE cur1 CURSOR FOR SELECT display_term FROM dbo.fts_parser(N'FORMSOF(FREETEXT, "' + @searchTerm + '")') WHERE special_term='Exact Match' --find position of each term in contents OPEN cur1 FETCH NEXT FROM cur1 INTO @term WHILE @@FETCH_STATUS = 0 BEGIN WHILE 1 = 1 BEGIN SET @p = CHARINDEX(@term, @contents, @p) IF @p <= 0 BREAK INSERT INTO @lb1 VALUES (@p) SET @p = @p + LEN(@term) END FETCH NEXT FROM cur1 INTO @term END CLOSE cur1 DEALLOCATE cur1
The last step also uses cursors (such is life). The idea is to return a 200 character summary that is most representative of the search. In my case, that meant that the summary would contain the most search terms. My implementation is not the most elegant algorithm for finding clusters in arrays (likely the least efficient brute-force approach) but it is hard to do much better within the limitation of T-SQL. Also, the logic can be improved to separate the summary on word boundaries instead of at arbitrary positions in the string but I preferred to implement that logic with the interface code. Ultimately, this is just and example and you will likely want to modify it to suit your needs.
--find term cluster DECLARE @lb2 TABLE (pos int, freq int) INSERT INTO @lb2 SELECT pos, 0 FROM @lb1 GROUP BY pos DECLARE cur2 CURSOR FOR SELECT pos FROM @lb2 OPEN cur2 FETCH NEXT FROM cur2 INTO @p WHILE @@FETCH_STATUS = 0 BEGIN UPDATE @lb2 SET freq = (SELECT COUNT(*) FROM @lb2 WHERE ABS(pos-@p)<100) WHERE pos=@p FETCH NEXT FROM cur2 INTO @p END CLOSE cur2 DEALLOCATE cur2 --return 200 character string based on term cluster center DECLARE @ClusterCenter INT SET @ClusterCenter = ( SELECT TOP 1 pos FROM @lb2 WHERE freq = (SELECT MAX(freq) FROM @lb2) ORDER BY pos) IF @ClusterCenter<100 SET @ClusterCenter = 100 RETURN SUBSTRING(@contents, @ClusterCenter-100, 200)
Code
The complete logic for the Full-Text Search Summary function can be found below:
• SummarizeFileIndex.sql