TS 16.4.1: Can I turn off Tika XML Parsing in SOLR?

I just got TS 16.4.1 up and running, and also the Indexer working with SOLR and Zookeeper. While indexing our content, I noticed some errors scroll by as I tailed the index logs. There were a lot of XML Parsing errors reported. A closer look revealed that it was parsing our HTML fragments and declaring the "XML" invalid when we are producing the exact HTML that we want to produce for our website fragments, according to the business. I don't want or need Tika.SOLR telling me our HTML is malformed by its XML Parser. I want to turn this off as I believe when it errors out on a file due to XML Parsing issues, the content of the file does not get indexed. I'm not entirely sure about that, but I tested it with one file that failed and I could not get Search to return that file when searching for keywords that appear in the file. The filename comes back in a filename search, but not any contents.

Anyway, does anyone know where this is configured and how I can turn that feature off?

Comments

  • David Smith
    edited March 12, 2019 #2

    I confirmed that I can still search for content in a file that was flagged by one of these errors - I found another file that had clearer content outside of its HTML attributes to search on and it returned the correct result. However, I still want to turn off the Tika Parsing.

    UPDATE: I lied - I was searching against the wrong file. When I attempted to search for keywords/phrases in the file that the parser flagged as malformed, I could not get search to return that file.

  • Hi there, this is a bit too technical for me, so just checking in first - did you resolve this with your update to the second post? BTW, thanks for upgrading to the latest version of TeamSite!

    Jacqui Newell
    Sr Product Marketing Manager
    OpenText

  • Hi Jacqui. No I did not resolve this. It’s a real concern for us as far as continuing with TeamSite. I received confirmation from our Support Engineer that indeed if the Tika XML Parser fails a file because it thinks it’s malformed, the contents of the file will not be indexed. I had a few hundred files that failed the Parser. However those files are formatted exactly how we want to format them. We can’t be expected to change our content to please the Parser.

    I’m told there is no way to turn the Parser off, which is very disappointing. It’s not the Search Indexer’s job to parse XML. Just index the files. I’m awaiting further analysis from Support, but this could be a showstopper for us.
  • Yea you are pretty much correct. The full text search is pretty much useless as implemented.

    I certainly don't get it. The previous search was marginal at best but worked better than this implementation

  • Agreed - I didn't think I'd every say "IDOL was better than this" but it's true at this point. I am getting all kinds of strange results when I attempt to search on Extended Attributes as well. It doesn't seem to be integrated properly or something.

  • I will rant some more on the use of this silly parser during Indexing. We utilize Server Side Includes in our HTML fragments that we produce from many of our templates via PTs. The parser doesn't like Server Side Includes. We're certainly not going to change THAT! How can OpenText expect their customers to create content to appease their Indexer because it is parsing the content prior to Indexing? That's utterly ridiculous. I'll repeat my earlier comment: It is not the Indexer's job to parse my content. Leave that to me. The Indexer just needs to index my content.

  • First, thank you for bringing this to our attention! I will forward all of your comments to the technical experts and see what I can find out for you.

    Jacqui Newell
    Sr Product Marketing Manager
    OpenText

  • @David Smith said:
    Agreed - I didn't think I'd every say "IDOL was better than this" but it's true at this point. I am getting all kinds of strange results when I attempt to search on Extended Attributes as well. It doesn't seem to be integrated properly or something.

    And the sad thing is that the parser really has little to do with Idol/SOLR. There are so many general purpose string parsers out there. That being said, Idol had keyview, which was performance hog, but functioned pretty well. Tika is well known and supposedly can: detect and extract metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

    So there is no reason that we should have issues except that Tika is not being used properly.

  • @Jacqui_N said:
    First, thank you for bringing this to our attention! I will forward all of your comments to the technical experts and see what I can find out for you.

    I certainly hope non of this is a surprise as both Smitty and I have been working with support on issues like this for months now.

  • Since I'm the marketer, so I don't get technical inquiries, but I'm always willing to help you find an answer. I just found out about this forum yesterday! I am sure it's not a surprise to those "in the know". I reached out to engineering earlier this morning...just waiting on a response.

    Jacqui Newell
    Sr Product Marketing Manager
    OpenText

  • Hi there! I heard back from the engineer and he said, "Support is the right channel they should approach for technical assistance." I saw that you have been working with support on this. Is there anything further I can do to help?

    Jacqui Newell
    Sr Product Marketing Manager
    OpenText

  • nipper
    edited March 14, 2019 #13

    Well reportedly there is a patch (Smitty received) we shall see if this works.

  • I will bring this forward!

    Jacqui Newell
    Sr Product Marketing Manager
    OpenText

  • I received a patch this morning and am currently re-indexing my branch to find out if it works. Fingers crossed.

  • Are there emojis on this forum? I want to show the fingers crossed emoji....picture it, if you will. :)

    Jacqui Newell
    Sr Product Marketing Manager
    OpenText

  • There aren't many formatting tools or emojis on this forum, Jacqui. I appreciate your help!

  • The patch I received appears to be working well. I'm a much happier camper.

  • Fabulous! Glad to hear it!!

    Jacqui Newell
    Sr Product Marketing Manager
    OpenText

TeamSite Developer Resources

  • Docker Automation

  • LiveSite Content Services (LSCS) REST API

  • Single Page Application (SPA) Modules

  • TeamSite Add-ons

If you are interested in gaining full access to the content, you can register for a My Support account here.
image
OpenText CE Products
TeamSite
APIs