I recently wrote about using the Full-Text search feature built into SQL Server to allow users to search through documents (and the challenge of displaying a summary of the search results). Configuring Full-Text search was a fairly easy process; however, populating the table containing the data to be searched turned out to be a bit more tricky. I wanted to avoid the overhead of using SQL Server FileStream and FileTables and needed control over the text that was extracted from documents. My only option was to implement a custom indexer to extract the text that I wanted from files and then store the text in the database.
My first attempt at extracting text from documents was using iFilters. This is the interface that Windows Search uses to index files. SQL Server use it as well to search through FileTables. I liked the universality of this approach because any file type registered on the server would be parse-able without requiring file-type specific code. After hours of browsing PInvoke.net and looking at working projects online, I finally got a library that compiled and parsed all of the file types that I needed. Unfortunately, I had to give up this approach because of several limitations. First, the code was just not stable enough for my liking. It required reading the registry, loading COM objects, and involved a great deal of unmanaged code. As a result, the library was hard to setup on various servers due to issues with 32 bit vs 64 bit iFilter libraries. Furthermore, because of the unmanaged code, I could not invoke this library from a web page and ended up implementing a Windows Service which indexed the documents offline. The final straw was that Adobe’s iFilter parser for PDF files sucks. It would parse the entire document into one string with no way to discern pages, lines or sections of the document. It was time to try something else.
My second attempt started out a bit more ambitious but ended up being much easier to implement. In order to have control over how the text was extracted from files, I would have to index the files myself. I reviewed the types of documents that I wanted to support and limited the library to PDF files, Microsoft Office documents, and plain text files. Text files are obviously easy to read using .NET so I started simple. Using the
TextReader class (which is the base class for both the
StringReader classes) allowed me to create some uniformity when reading chunks from a text file or from a string extracted from a formatted document. I decided to create a similar interface for my class that the
TextReader uses (a constructor that opens the file and a read method that buffers the text in blocks until the end of the file is reached).
Whith the easy part out of the way, I set out to read PDF files. Luckily, I found the iTextSharp library which did all of the heavy lifting. Using this library, reading a PDF document ended up being as simple as:
Dim sb As New StringBuilder Using PDFDoc As New PdfReader(_FileName) For i As Integer = 1 To PDFDoc.NumberOfPages Dim objExtractStrategy As New SimpleTextExtractionStrategy() Dim pageText As String = PdfTextExtractor.GetTextFromPage(PDFDoc, i, objExtractStrategy) sb.Append(pageText & vbCrLf) Next End Using
My next break came from realizing that the Microsoft new file format (Office Open XML) was simply a compressed folder of XML files. Microsoft even releases an SDK with sample code for Open XML which makes parsing Office documents almost trivial. I did have to write different logic for parsing DOCX files, XMLX files, and PPTX files but there is not much to it. The trick with Word files was to make sure to get the latest version of the text inside the document for files with revision tracking enabled. I also found it was important to extract each cell of a table embedded inside the Word document individually so that the content did not run together. The trick with Excel files was that strings a stored in look-up tables and failing to perform the look-up will result in an extract of a bunch of meaningless numbers. PowerPoint files are the most straight forward to read by simply extracting the text one slide at a time. A link with the complete code is provided at the end of this article.
Below is a sample example of how to use the Document Indexer. It reads blocks of text from the files and inserts the text into the SQL Server table. Note that a parametrized SQL Command is used to avoid any complications (e.g. security concerns) with problematic file content. Think what would happen if there is a single quote in the file and it gets passed directly to the built SQL string. The
WRITE method is used on the SQL Server to append blocks of text to the field (which is the preferred way to update
VARCHAR(MAX) fields). Documentation for this method indicates that 8040 bytes is the optimal block size.
'Generates file index and stores to DB Dim Indexer As New DocumentIndexer(FilePath, DocumentIndexer.DEFAULT_BLOCK_SIZE) Dim buffer(DocumentIndexer.DEFAULT_BLOCK_SIZE - 1) As Char, readCount As Integer = DocumentIndexer.DEFAULT_BLOCK_SIZE Dim SQL_Clear As String = "UPDATE tblFiles SET FileIndex='' WHERE FileID=" & FileID Dim SQL_Data As String = "UPDATE tblFiles SET FileIndex.WRITE(@Data, NULL, NULL) WHERE FileID=" & FileID Try 'Clear existing index If ClearIndex Then DButils.RunSQLCommand(SQL_Clear, True) 'Read blocks of text from stream and append to database field While readCount > 0 readCount = Indexer.Read(buffer) If readCount > 0 Then Dim Data As New System.Data.SqlClient.SqlParameter("@Data", SqlDbType.VarChar, readCount) Data.Value = New String(buffer, 0, readCount) DButils.RunSQLCommand(SQL_Data, Data) End If End While Catch ex As Exception Throw ex Finally Indexer.Dispose() End Try
As a side benefit I found that indexing most formatted documents and storing only the plain text reduced the storage requirement by a factor of 10 (even without compression). That’s all there is to it.
The complete logic for the Document Indexer can be found below: