Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET

Question

Indexing .PDF, .XLS, .DOC, .PPT using Lucene.NET

I've heard of Lucene.Net and I've heard of Apache Tika. The question is - how do I index these documents using C# vs Java? I think the issue is that there is no .Net equivalent of Tika which extracts relevant text from these document types.

UPDATE - Feb 05 2011

Based on given responses, it seems that the is not currently a native .Net equivalent of Tika. 2 interesting projects were mentioned that are each interesting in their own right:

Xapian Project (http://xapian.org/) - An alternative to Lucene written in unmanaged code. The project claims to support "swig" which allows for C# bindings. Within the Xapian Project there is an out-of-the-box search engine called Omega. Omega uses a variety of open source components to extract text from various document types.
IKVM.NET (http://www.ikvm.net/) - Allows Java to be run from .Net. An example of using IKVM to run Tika can be found here.

Given the above 2 projects, I see a couple of options. To extract the text, I could either a) use the same components that Omega is using or b) use IKVM to run Tika. To me, option b) seems cleaner as there are only 2 dependencies.

The interesting part is that now there are several search engines that could probably be used from .Net. There is Xapian, Lucene.Net or even Lucene (using IKVM).

UPDATE - Feb 07 2011

Another answer came in recommending that I check out ifilters. As it turns out, this is what MS uses for windows search so Office ifilters are readily available. Also, there are some PDF ifilters out there. The downside is that they are implemented in unmanaged code, so COM interop is necessary to use them. I found the below code snippit on a DotLucene.NET archive (no longer an active project):

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Text;

namespace IFilter
{
    [Flags]
    public enum IFILTER_INIT : uint
    {
        NONE = 0,
        CANON_PARAGRAPHS = 1,
        HARD_LINE_BREAKS = 2,
        CANON_HYPHENS = 4,
        CANON_SPACES = 8,
        APPLY_INDEX_ATTRIBUTES = 16,
        APPLY_CRAWL_ATTRIBUTES = 256,
        APPLY_OTHER_ATTRIBUTES = 32,
        INDEXING_ONLY = 64,
        SEARCH_LINKS = 128,
        FILTER_OWNED_VALUE_OK = 512
    }

    public enum CHUNK_BREAKTYPE
    {
        CHUNK_NO_BREAK = 0,
        CHUNK_EOW = 1,
        CHUNK_EOS = 2,
        CHUNK_EOP = 3,
        CHUNK_EOC = 4
    }

    [Flags]
    public enum CHUNKSTATE
    {
        CHUNK_TEXT = 0x1,
        CHUNK_VALUE = 0x2,
        CHUNK_FILTER_OWNED_VALUE = 0x4
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct PROPSPEC
    {
        public uint ulKind;
        public uint propid;
        public IntPtr lpwstr;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FULLPROPSPEC
    {
        public Guid guidPropSet;
        public PROPSPEC psProperty;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct STAT_CHUNK
    {
        public uint idChunk;
        [MarshalAs(UnmanagedType.U4)] public CHUNK_BREAKTYPE breakType;
        [MarshalAs(UnmanagedType.U4)] public CHUNKSTATE flags;
        public uint locale;
        [MarshalAs(UnmanagedType.Struct)] public FULLPROPSPEC attribute;
        public uint idChunkSource;
        public uint cwcStartSource;
        public uint cwcLenSource;
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct FILTERREGION
    {
        public uint idChunk;
        public uint cwcStart;
        public uint cwcExtent;
    }

    [ComImport]
    [Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]
    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    public interface IFilter
    {
        [PreserveSig]
        int Init([MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, uint cAttributes, [MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes, ref uint pdwFlags);

        [PreserveSig]
        int GetChunk(out STAT_CHUNK pStat);

        [PreserveSig]
        int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer);

        void GetValue(ref UIntPtr ppPropValue);
        void BindRegion([MarshalAs(UnmanagedType.Struct)] FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);
    }

    [ComImport]
    [Guid("f07f3920-7b8c-11cf-9be8-00aa004b9986")]
    public class CFilter
    {
    }

    public class IFilterConstants
    {
        public const uint PID_STG_DIRECTORY = 0x00000002;
        public const uint PID_STG_CLASSID = 0x00000003;
        public const uint PID_STG_STORAGETYPE = 0x00000004;
        public const uint PID_STG_VOLUME_ID = 0x00000005;
        public const uint PID_STG_PARENT_WORKID = 0x00000006;
        public const uint PID_STG_SECONDARYSTORE = 0x00000007;
        public const uint PID_STG_FILEINDEX = 0x00000008;
        public const uint PID_STG_LASTCHANGEUSN = 0x00000009;
        public const uint PID_STG_NAME = 0x0000000a;
        public const uint PID_STG_PATH = 0x0000000b;
        public const uint PID_STG_SIZE = 0x0000000c;
        public const uint PID_STG_ATTRIBUTES = 0x0000000d;
        public const uint PID_STG_WRITETIME = 0x0000000e;
        public const uint PID_STG_CREATETIME = 0x0000000f;
        public const uint PID_STG_ACCESSTIME = 0x00000010;
        public const uint PID_STG_CHANGETIME = 0x00000011;
        public const uint PID_STG_CONTENTS = 0x00000013;
        public const uint PID_STG_SHORTNAME = 0x00000014;
        public const int FILTER_E_END_OF_CHUNKS = (unchecked((int) 0x80041700));
        public const int FILTER_E_NO_MORE_TEXT = (unchecked((int) 0x80041701));
        public const int FILTER_E_NO_MORE_VALUES = (unchecked((int) 0x80041702));
        public const int FILTER_E_NO_TEXT = (unchecked((int) 0x80041705));
        public const int FILTER_E_NO_VALUES = (unchecked((int) 0x80041706));
        public const int FILTER_S_LAST_TEXT = (unchecked((int) 0x00041709));
    }

    /// 
    /// IFilter return codes
    /// 
    public enum IFilterReturnCodes : uint
    {
        /// 
        /// Success
        /// 
        S_OK = 0,
        /// 
        /// The function was denied access to the filter file. 
        /// 
        E_ACCESSDENIED = 0x80070005,
        /// 
        /// The function encountered an invalid handle, probably due to a low-memory situation. 
        /// 
        E_HANDLE = 0x80070006,
        /// 
        /// The function received an invalid parameter.
        /// 
        E_INVALIDARG = 0x80070057,
        /// 
        /// Out of memory
        /// 
        E_OUTOFMEMORY = 0x8007000E,
        /// 
        /// Not implemented
        /// 
        E_NOTIMPL = 0x80004001,
        /// 
        /// Unknown error
        /// 
        E_FAIL = 0x80000008,
        /// 
        /// File not filtered due to password protection
        /// 
        FILTER_E_PASSWORD = 0x8004170B,
        /// 
        /// The document format is not recognised by the filter
        /// 
        FILTER_E_UNKNOWNFORMAT = 0x8004170C,
        /// 
        /// No text in current chunk
        /// 
        FILTER_E_NO_TEXT = 0x80041705,
        /// 
        /// No more chunks of text available in object
        /// 
        FILTER_E_END_OF_CHUNKS = 0x80041700,
        /// 
        /// No more text available in chunk
        /// 
        FILTER_E_NO_MORE_TEXT = 0x80041701,
        /// 
        /// No more property values available in chunk
        /// 
        FILTER_E_NO_MORE_VALUES = 0x80041702,
        /// 
        /// Unable to access object
        /// 
        FILTER_E_ACCESS = 0x80041703,
        /// 
        /// Moniker doesn't cover entire region
        /// 
        FILTER_W_MONIKER_CLIPPED = 0x00041704,
        /// 
        /// Unable to bind IFilter for embedded object
        /// 
        FILTER_E_EMBEDDING_UNAVAILABLE = 0x80041707,
        /// 
        /// Unable to bind IFilter for linked object
        /// 
        FILTER_E_LINK_UNAVAILABLE = 0x80041708,
        /// 
        /// This is the last text in the current chunk
        /// 
        FILTER_S_LAST_TEXT = 0x00041709,
        /// 
        /// This is the last value in the current chunk
        /// 
        FILTER_S_LAST_VALUES = 0x0004170A
    }

    /// 
    /// Convenience class which provides static methods to extract text from files using installed IFilters
    /// 
    public class DefaultParser
    {
        public DefaultParser()
        {
        }

        [DllImport("query.dll", CharSet = CharSet.Unicode)]
        private extern static int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

        private static IFilter loadIFilter(string filename)
        {
            object outer = null;
            IFilter filter = null;

            // Try to load the corresponding IFilter
            int resultLoad = LoadIFilter(filename,  outer, ref filter);
            if (resultLoad != (int) IFilterReturnCodes.S_OK)
            {
                return null;
            }
            return filter;
        }

        public static bool IsParseable(string filename)
        {
            return loadIFilter(filename) != null;
        }

        public static string Extract(string path)
        {
            StringBuilder sb = new StringBuilder();
            IFilter filter = null;

            try
            {
                filter = loadIFilter(path);

                if (filter == null)
                    return String.Empty;

                uint i = 0;
                STAT_CHUNK ps = new STAT_CHUNK();

                IFILTER_INIT iflags =
                    IFILTER_INIT.CANON_HYPHENS |
                    IFILTER_INIT.CANON_PARAGRAPHS |
                    IFILTER_INIT.CANON_SPACES |
                    IFILTER_INIT.APPLY_CRAWL_ATTRIBUTES |
                    IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
                    IFILTER_INIT.APPLY_OTHER_ATTRIBUTES |
                    IFILTER_INIT.HARD_LINE_BREAKS |
                    IFILTER_INIT.SEARCH_LINKS |
                    IFILTER_INIT.FILTER_OWNED_VALUE_OK;

                if (filter.Init(iflags, 0, null, ref i) != (int) IFilterReturnCodes.S_OK)
                    throw new Exception("Problem initializing an IFilter for:\n" + path + " \n\n");

                while (filter.GetChunk(out ps) == (int) (IFilterReturnCodes.S_OK))
                {
                    if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
                    {
                        IFilterReturnCodes scode = 0;
                        while (scode == IFilterReturnCodes.S_OK || scode == IFilterReturnCodes.FILTER_S_LAST_TEXT)
                        {
                            uint pcwcBuffer = 65536;
                            System.Text.StringBuilder sbBuffer = new System.Text.StringBuilder((int)pcwcBuffer);

                            scode = (IFilterReturnCodes) filter.GetText(ref pcwcBuffer, sbBuffer);

                            if (pcwcBuffer > 0 && sbBuffer.Length > 0)
                            {
                                if (sbBuffer.Length < pcwcBuffer) // Should never happen, but it happens !
                                    pcwcBuffer = (uint)sbBuffer.Length;

                                sb.Append(sbBuffer.ToString(0, (int) pcwcBuffer));
                                sb.Append(" "); // "\r\n"
                            }

                        }
                    }

                }
            }
            finally
            {
                if (filter != null) {
                    Marshal.ReleaseComObject (filter);
                    System.GC.Collect();
                    System.GC.WaitForPendingFinalizers();
                }
            }

            return sb.ToString();
        }
    }
}

At the moment, this seems like the best way to extract text from documents using the .NET platform on a Windows server. Thanks everybody for your help.

UPDATE - Mar 08 2011

While I still think that ifilters are a good way to go, I think if you are looking to index documents using Lucene from .NET, a very good alternative would be to use Solr. When I first started researching this topic, I had never heard of Solr. So, for those of you who have not either, Solr is a stand-alone search service, written in Java on top of Lucene. The idea is that you can fire up Solr on a firewalled machine, and communicate with it via HTTP from your .NET application. Solr is truly written like a service and can do everything Lucene can do, (including using Tika extract text from .PDF, .XLS, .DOC, .PPT, etc), and then some. Solr seems to have a very active community as well, which is one thing I am not to sure of with regards to Lucene.NET.

.net

asp.net

lucene

solr

lucene.net

asked on Stack Overflow Feb 5, 2011 by

dana • edited Oct 21, 2019 by

JohZant

4 Answers

You can also check out ifilters - there are a number of resources if you do a search for asp.net ifilters:

Of course, there is added hassle if you are distributing this to client systems, because you will either need to include the ifilters with your distribution and install those with your app on their machine, or they will lack the ability to extract text from any files they don't have ifilters for.

answered on Stack Overflow Feb 5, 2011 by

Prescott • edited May 23, 2017 by

Community

This is one of the reasons I was dissatisfied with Lucene for a project I was working on. Xapian is a competing product, and is orders of magnitude faster than Lucene in some cases and has other compelling features (well, they were compelling to me at the time). The big issue? It's written in C++ and you have to interop to it. That's for indexing and retrieval. For the actual parsing of the text, that's where Lucene really falls down -- you have to do it yourself. Xapian has an omega component that manages calling other third party components to extract data. In my limited testing it worked pretty darn well. I did not finish the project (more than POC) but I did write up my experience compiling it for 64 bit. Of course this was almost a year ago, so things might have changed.

If you dig into the Omega documentation you can see the tools that they use to parse documents.

PDF (.pdf) if pdftotext is available (comes with xpdf)

PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes with xpdf) are available

OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) if unzip is available

OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is available

MS Word documents (.doc, .dot) if antiword is available

MS Excel documents (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)

MS Powerpoint documents (.ppt, .pps) if catppt is available, (comes with catdoc)

MS Office 2007 documents (.docx, .dotx, .xlsx, .xlst, .pptx, .potx, .ppsx) if unzip is available

Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)

MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)

Compressed AbiWord documents (.zabw) if gzip is available

Rich Text Format documents (.rtf) if unrtf is available

Perl POD documentation (.pl, .pm, .pod) if pod2text is available

TeX DVI files (.dvi) if catdvi is available

DjVu files (.djv, .djvu) if djvutxt is available

XPS files (.xps) if unzip is available

answered on Stack Overflow Feb 5, 2011 by

Sean • edited Jun 20, 2020 by

Community

Apparently you can use Tika from .net (link)

I have not tried this myself.

answered on Stack Overflow Feb 5, 2011 by

Development 4.0

Other angle here is that Lucene indexes are binary compatible between java and .NET. So you could write the index with Tika and read it with C#.

answered on Stack Overflow Mar 8, 2011 by

Wyatt Barnett

User contributions licensed under CC BY-SA 3.0