TS 16.4.1: Can I turn off Tika XML Parsing in SOLR?

David Smith · March 12, 2019

I just got TS 16.4.1 up and running, and also the Indexer working with SOLR and Zookeeper. While indexing our content, I noticed some errors scroll by as I tailed the index logs. There were a lot of XML Parsing errors reported. A closer look revealed that it was parsing our HTML fragments and declaring the "XML" invalid when we are producing the exact HTML that we want to produce for our website fragments, according to the business. I don't want or need Tika.SOLR telling me our HTML is malformed by its XML Parser. I want to turn this off as I believe when it errors out on a file due to XML Parsing issues, the content of the file does not get indexed. I'm not entirely sure about that, but I tested it with one file that failed and I could not get Search to return that file when searching for keywords that appear in the file. The filename comes back in a filename search, but not any contents.

Anyway, does anyone know where this is configured and how I can turn that feature off?

David Smith · March 12, 2019

I confirmed that I can still search for content in a file that was flagged by one of these errors - I found another file that had clearer content outside of its HTML attributes to search on and it returned the correct result. However, I still want to turn off the Tika Parsing.

UPDATE: I lied - I was searching against the wrong file. When I attempted to search for keywords/phrases in the file that the parser flagged as malformed, I could not get search to return that file.

Jacqui_N · March 12, 2019

Hi there, this is a bit too technical for me, so just checking in first - did you resolve this with your update to the second post? BTW, thanks for upgrading to the latest version of TeamSite!

David Smith · March 13, 2019

Hi Jacqui. No I did not resolve this. It’s a real concern for us as far as continuing with TeamSite. I received confirmation from our Support Engineer that indeed if the Tika XML Parser fails a file because it thinks it’s malformed, the contents of the file will not be indexed. I had a few hundred files that failed the Parser. However those files are formatted exactly how we want to format them. We can’t be expected to change our content to please the Parser.

I’m told there is no way to turn the Parser off, which is very disappointing. It’s not the Search Indexer’s job to parse XML. Just index the files. I’m awaiting further analysis from Support, but this could be a showstopper for us.

nipper · March 13, 2019

Yea you are pretty much correct. The full text search is pretty much useless as implemented.

I certainly don't get it. The previous search was marginal at best but worked better than this implementation

David Smith · March 13, 2019

Agreed - I didn't think I'd every say "IDOL was better than this" but it's true at this point. I am getting all kinds of strange results when I attempt to search on Extended Attributes as well. It doesn't seem to be integrated properly or something.

David Smith · March 13, 2019

I will rant some more on the use of this silly parser during Indexing. We utilize Server Side Includes in our HTML fragments that we produce from many of our templates via PTs. The parser doesn't like Server Side Includes. We're certainly not going to change THAT! How can OpenText expect their customers to create content to appease their Indexer because it is parsing the content prior to Indexing? That's utterly ridiculous. I'll repeat my earlier comment: It is not the Indexer's job to parse my content. Leave that to me. The Indexer just needs to index my content.

Jacqui_N · March 13, 2019

First, thank you for bringing this to our attention! I will forward all of your comments to the technical experts and see what I can find out for you.

nipper · March 13, 2019

@David Smith said:
Agreed - I didn't think I'd every say "IDOL was better than this" but it's true at this point. I am getting all kinds of strange results when I attempt to search on Extended Attributes as well. It doesn't seem to be integrated properly or something.

And the sad thing is that the parser really has little to do with Idol/SOLR. There are so many general purpose string parsers out there. That being said, Idol had keyview, which was performance hog, but functioned pretty well. Tika is well known and supposedly can: detect and extract metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

So there is no reason that we should have issues except that Tika is not being used properly.

nipper · March 13, 2019

@Jacqui_N said:
First, thank you for bringing this to our attention! I will forward all of your comments to the technical experts and see what I can find out for you.

I certainly hope non of this is a surprise as both Smitty and I have been working with support on issues like this for months now.

Jacqui_N · March 13, 2019

Since I'm the marketer, so I don't get technical inquiries, but I'm always willing to help you find an answer. I just found out about this forum yesterday! I am sure it's not a surprise to those "in the know". I reached out to engineering earlier this morning...just waiting on a response.

Jacqui_N · March 14, 2019

Hi there! I heard back from the engineer and he said, "Support is the right channel they should approach for technical assistance." I saw that you have been working with support on this. Is there anything further I can do to help?

nipper · March 14, 2019

Well reportedly there is a patch (Smitty received) we shall see if this works.

Jacqui_N · March 14, 2019

I will bring this forward!

David Smith · March 14, 2019

I received a patch this morning and am currently re-indexing my branch to find out if it works. Fingers crossed.

Jacqui_N · March 14, 2019

Are there emojis on this forum? I want to show the fingers crossed emoji....picture it, if you will.

David Smith · March 14, 2019

There aren't many formatting tools or emojis on this forum, Jacqui. I appreciate your help!

David Smith · March 14, 2019

The patch I received appears to be working well. I'm a much happier camper.

Jacqui_N · March 14, 2019

Fabulous! Glad to hear it!!

TS 16.4.1: Can I turn off Tika XML Parsing in SOLR?

Comments

Categories