Leon's Weblog

November 16, 2013

Document Indexer Library for .NET

Filed under: Software Dev — Leon @ 10:09 pm

Document Filter I recently wrote about using the Full-Text search feature built into SQL Server to allow users to search through documents (and the challenge of displaying a summary of the search results). Configuring Full-Text search was a fairly easy process; however, populating the table containing the data to be searched turned out to be a bit more tricky. I wanted to avoid the overhead of using SQL Server FileStream and FileTables and needed control over the text that was extracted from documents. My only option was to implement a custom indexer to extract the text that I wanted from files and then store the text in the database.

My first attempt at extracting text from documents was using iFilters. This is the interface that Windows Search uses to index files. SQL Server use it as well to search through FileTables. I liked the universality of this approach because any file type registered on the server would be parse-able without requiring file-type specific code. After hours of browsing PInvoke.net and looking at working projects online, I finally got a library that compiled and parsed all of the file types that I needed. Unfortunately, I had to give up this approach because of several limitations. First, the code was just not stable enough for my liking. It required reading the registry, loading COM objects, and involved a great deal of unmanaged code. As a result, the library was hard to setup on various servers due to issues with 32 bit vs 64 bit iFilter libraries. Furthermore, because of the unmanaged code, I could not invoke this library from a web page and ended up implementing a Windows Service which indexed the documents offline. The final straw was that Adobe’s iFilter parser for PDF files sucks. It would parse the entire document into one string with no way to discern pages, lines or sections of the document. It was time to try something else.

My second attempt started out a bit more ambitious but ended up being much easier to implement. In order to have control over how the text was extracted from files, I would have to index the files myself. I reviewed the types of documents that I wanted to support and limited the library to PDF files, Microsoft Office documents, and plain text files. Text files are obviously easy to read using .NET so I started simple. Using the TextReader class (which is the base class for both the StreamReader and StringReader classes) allowed me to create some uniformity when reading chunks from a text file or from a string extracted from a formatted document. I decided to create a similar interface for my class that the TextReader uses (a constructor that opens the file and a read method that buffers the text in blocks until the end of the file is reached).

Whith the easy part out of the way, I set out to read PDF files. Luckily, I found the iTextSharp library which did all of the heavy lifting. Using this library, reading a PDF document ended up being as simple as:

Dim sb As New StringBuilder

Using PDFDoc As New PdfReader(_FileName)
    For i As Integer = 1 To PDFDoc.NumberOfPages
	Dim objExtractStrategy As New SimpleTextExtractionStrategy()
	Dim pageText As String = PdfTextExtractor.GetTextFromPage(PDFDoc, i, objExtractStrategy)
	sb.Append(pageText & vbCrLf)
    Next
End Using

My next break came from realizing that the Microsoft new file format (Office Open XML) was simply a compressed folder of XML files. Microsoft even releases an SDK with sample code for Open XML which makes parsing Office documents almost trivial. I did have to write different logic for parsing DOCX files, XMLX files, and PPTX files but there is not much to it. The trick with Word files was to make sure to get the latest version of the text inside the document for files with revision tracking enabled. I also found it was important to extract each cell of a table embedded inside the Word document individually so that the content did not run together. The trick with Excel files was that strings a stored in look-up tables and failing to perform the look-up will result in an extract of a bunch of meaningless numbers. PowerPoint files are the most straight forward to read by simply extracting the text one slide at a time. A link with the complete code is provided at the end of this article.

Below is a sample example of how to use the Document Indexer. It reads blocks of text from the files and inserts the text into the SQL Server table. Note that a parametrized SQL Command is used to avoid any complications (e.g. security concerns) with problematic file content. Think what would happen if there is a single quote in the file and it gets passed directly to the built SQL string. The WRITE method is used on the SQL Server to append blocks of text to the field (which is the preferred way to update VARCHAR(MAX) fields). Documentation for this method indicates that 8040 bytes is the optimal block size.

'Generates file index and stores to DB
Dim Indexer As New DocumentIndexer(FilePath, DocumentIndexer.DEFAULT_BLOCK_SIZE)
Dim buffer(DocumentIndexer.DEFAULT_BLOCK_SIZE - 1) As Char, readCount As Integer = DocumentIndexer.DEFAULT_BLOCK_SIZE
Dim SQL_Clear As String = "UPDATE tblFiles SET FileIndex='' WHERE FileID=" & FileID
Dim SQL_Data As String = "UPDATE tblFiles SET FileIndex.WRITE(@Data, NULL, NULL) WHERE FileID=" & FileID

Try
    'Clear existing index
    If ClearIndex Then DButils.RunSQLCommand(SQL_Clear, True)

    'Read blocks of text from stream and append to database field    
    While readCount > 0
	readCount = Indexer.Read(buffer)
	If readCount > 0 Then
	    Dim Data As New System.Data.SqlClient.SqlParameter("@Data", SqlDbType.VarChar, readCount)
	    Data.Value = New String(buffer, 0, readCount)
	    DButils.RunSQLCommand(SQL_Data, Data)
	End If
    End While

Catch ex As Exception
    Throw ex
Finally
    Indexer.Dispose()
End Try

As a side benefit I found that indexing most formatted documents and storing only the plain text reduced the storage requirement by a factor of 10 (even without compression). That’s all there is to it.

Code
The complete logic for the Document Indexer can be found below:
• DocumentIndexer.vb

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment