Welcome to Dovetail Software Blogs : Sign in | Join | Help
Using the Tika Java Library In Your .Net Application With IKVM

This may sound scary and heretical but did you know it is possible to leverage Java libraries from .Net applications with no TCP sockets or web services getting caught in the crossfire? Let me introduce you to IKVM, which is frankly magic:

IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:

  • A Java Virtual Machine implemented in .NET
  • A .NET implementation of the Java class libraries
  • Tools that enable Java and .NET interoperability

Using IKVM we have been able to successfully integrate our Dovetail Seeker search application with the Tika text extraction library implemented in Java. With Tika we can easily pull text out of rich documents from many supported formats. Why Tika?  Because there is nothing comparable in the .Net world as Tika.

This post will review how we integrated with Tika. If you like code you can find this example in a repo up on Github.

Compiling a Jar Into An Assembly

First thing, we need to get our hands on the latest version of Tika. I downloaded and built the Tika source using Maven as instructed. The result of this was a few jar files. The one we are interested in is tika-app-x.x.jar which has everything we need bundled into one useful container.

Next up we need to convert this jar we’ve built to a .Net assembly. Do this using ikvmc.exe.

tika\build>ikvmc.exe -target:library tika-app-0.7.jar

Unfortunately, you will see tons of troublesome looking warnings but the end result is a .Net assembly wrapping the Java jar which you can reference in your projects. 

Using Tika From .Net

IKVM is pretty transparent. You simply reference the the Tika app assembly and your .Net code is talking to Java types. It is a bit weird at first as you have Java versions of types and .Net versions. Next you’ll want to make sure that all the dependent IKVM runtime assemblies are included with your project. Using Reflector I found that the Tika app assembly referenced a lot of IKVM assemblies which do not appear to be used. I had to figure out through trial and error which assemblies where not being touched by the rich document extractions being done. If need be you could simple include all of the referenced IKVM assemblies with your application. Below I have done the work for you and eliminated all references to all the IKVM assemblies which appear to be in play.

image

16 assemblies down to 5. A much smaller deployment.

Using Tika

To do some text extraction we’ll ask Tika, very nicely, to parse the files we throw at it. For my purposes this involved having Tika automatically determine how to parse the stream and extract the text and metadata about the document.

public TextExtractionResult Extract(string filePath)
{
    var parser = new AutoDetectParser();
    var metadata = new Metadata();
    var parseContext = new ParseContext();
    java.lang.Class parserClass = parser.GetType();
    parseContext.set(parserClass, parser);

    try
    {
        var file = new File(filePath);
        var url = file.toURI().toURL();
        using (var inputStream = MetadataHelper.getInputStream(url, metadata))
        {
            parser.parse(inputStream, getTransformerHandler(), metadata, parseContext);
            inputStream.close();
        }

        return assembleExtractionResult(_outputWriter.toString(), metadata);
    }
    catch (Exception ex)
    {
        throw new ApplicationException("Extraction of text from the file '{0}' failed.".ToFormat(filePath), ex);
    }
}

One Important Cavet

Java has a concept called a ClassLoader which has something to do with how Java types are found and loaded. There is probably a better way around this but for some reason if you do not implement a custom ClassLoader and also set an application setting cueing the IKVM runtime about which .Net type to use as the ClassLoader.

    public class MySystemClassLoader : ClassLoader
    {
        public MySystemClassLoader(ClassLoader parent)
            : base(new AppDomainAssemblyClassLoader(typeof(MySystemClassLoader).Assembly))
        {
        }
    }

Here is an example app.config telling IKVM where the ClassLoader is found.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <appSettings>
        <add key="ikvm:java.system.class.loader" value="TikaOnDotNet.MySystemClassLoader, TikaOnDotNet" />
    </appSettings>
</configuration>

This step is very important. If IKVM cannot find a class loader, for some horrible reason, Tika will work fine but extract only empty documents with no metadata. The main reason this is troubling is that no exception is raised. For this reason we actually have a validation step in our application ensuring that the app setting is present and that it resolves to a valid type.

Demo

Here is a test demonstrating an extraction and the result.

    [Test]
    public void should_extract_from_pdf()
    {
        var textExtractionResult = new TextExtractor().Extract("Tika.pdf");

        textExtractionResult.Text.ShouldContain("pack of pickled almonds");

        Console.WriteLine(textExtractionResult);
    }

Put simply rich documents like this go in.

Test PDF

And a TextExtractionResult comes out:

public class TextExtractionResult
{
    public string Text { get; set; }
    public string ContentType { get; set; }
    public IDictionary<string, string> Metadata { get; set; } 
    //toString() override
}

Here is the raw output from Tika:

image

Conclusion

I hope this helps boost your confidence that you can use Java libraries in your .Net code and I hope my example repo will be of assistance if you need to do some work with Tika on the .Net platform. Enjoy.

Posted: Friday, July 02, 2010 3:16 PM by kmiller
Filed under: , ,

Comments

Jim said:

Hi,

Thanks for posting this article and the code.  Something I have been very keen to look into.

When I use the TikaOnDotNet project from a WPF 4 application.  If I call:

var textExtractionResult = new TextExtractor().Extract(@"c:\Tika.pdf");

I get

Message=Could not load file or assembly 'IKVM.OpenJDK.Media, Version=0.42.0.6, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The system cannot find the file specified.

      Source=tika-app-0.7

      FileName=IKVM.OpenJDK.Media, Version=0.42.0.6, Culture=neutral, PublicKeyToken=13235d27fcbfff58

Does it need IKVM.OpenJDK.Media?

Or am I doing something wrong?

best regards

# July 7, 2010 9:16 AM

kmiller said:

Jim,

Very glad you are taking a look and trying out the code. I noticed similar behavior yesterday on my main project. Clearly, I didn't manage all dependencies correctly.

You are basically seeing the result of missing IKVM assemblies. You can correct it by downloading IKVM and adding the missing assemblies being complained about.

IKVM distro I used: http://sourceforge.net/projects/ikvm/files/ikvm/0.42.0.6/ikvmbin-0.42.0.6.zip/download

I'll try to update the repo with what's missing. If you create a ticket that would be handy too.

# July 7, 2010 10:28 AM

ChandraBabu said:

Hi,

Can you please upload .Net sample source code which usese Tika library. I am new to Java and dont know much about converting Tika jar to .Net assembly

Thanks & Regards,

Chandra Babu

# July 9, 2010 12:11 AM

kmiller said:

@Chandra

As I mention in the post I uploaded to code to Github. That includes the already .net Tika assembly which I created using IKVM.  Enjoy.

http://github.com/KevM/tikaondotnet

# July 9, 2010 8:49 AM

Biztalk Musings said:

Using TIKA into .NET

# July 14, 2010 12:40 PM

Andy said:

Just needed to leave a comment saying excellent article, thank you for taking the time to write it and make the code available.

# July 21, 2010 9:07 AM

Deepak said:

i am not able to find reference of Microsoft.CSharp ...

# July 22, 2010 4:06 AM

kmiller said:

@Deepak it is currently a .Net 4 project on VS 2010. No reason it has to be so I downgraded it to .Net 3.5. Do a pull from master.

# July 22, 2010 9:54 AM

Andy said:

you can get round having to have a app.config file by doing:

public TextExtractor()

{            ConfigurationManager.AppSettings.Set("ikvm:java.system.class.loader", "TikaOnDotNet.MySystemClassLoader, TikaOnDotNet");

}

# July 23, 2010 4:17 AM

kmiller said:

@Andy thank you for the suggestion. That would indeed avoid having users accidentally blow away the config. Thank you for checking out the repo.

# July 23, 2010 7:28 AM

Kevin Miller said:

One of our products is currently using IKVM to interop with a java library . I kept seeing console warnings

# August 12, 2010 2:52 PM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

  

Enter Code Here: Required

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS