TS 16.4.1: Can I turn off Tika XML Parsing in SOLR?
I just got TS 16.4.1 up and running, and also the Indexer working with SOLR and Zookeeper. While indexing our content, I noticed some errors scroll by as I tailed the index logs. There were a lot of XML Parsing errors reported. A closer look revealed that it was parsing our HTML fragments and declaring the "XML" invalid when we are producing the exact HTML that we want to produce for our website fragments, according to the business. I don't want or need Tika.SOLR telling me our HTML is malformed by its XML Parser. I want to turn this off as I believe when it errors out on a file due to XML Parsing issues, the content of the file does not get indexed. I'm not entirely sure about that, but I tested it with one file that failed and I could not get Search to return that file when searching for keywords that appear in the file. The filename comes back in a filename search, but not any contents.
Anyway, does anyone know where this is configured and how I can turn that feature off?
Comments
-
I confirmed that I can still search for content in a file that was flagged by one of these errors - I found another file that had clearer content outside of its HTML attributes to search on and it returned the correct result. However, I still want to turn off the Tika Parsing.
UPDATE: I lied - I was searching against the wrong file. When I attempted to search for keywords/phrases in the file that the parser flagged as malformed, I could not get search to return that file.
0 -
Hi Jacqui. No I did not resolve this. It’s a real concern for us as far as continuing with TeamSite. I received confirmation from our Support Engineer that indeed if the Tika XML Parser fails a file because it thinks it’s malformed, the contents of the file will not be indexed. I had a few hundred files that failed the Parser. However those files are formatted exactly how we want to format them. We can’t be expected to change our content to please the Parser.
I’m told there is no way to turn the Parser off, which is very disappointing. It’s not the Search Indexer’s job to parse XML. Just index the files. I’m awaiting further analysis from Support, but this could be a showstopper for us.0 -
Agreed - I didn't think I'd every say "IDOL was better than this" but it's true at this point. I am getting all kinds of strange results when I attempt to search on Extended Attributes as well. It doesn't seem to be integrated properly or something.
1 -
I will rant some more on the use of this silly parser during Indexing. We utilize Server Side Includes in our HTML fragments that we produce from many of our templates via PTs. The parser doesn't like Server Side Includes. We're certainly not going to change THAT! How can OpenText expect their customers to create content to appease their Indexer because it is parsing the content prior to Indexing? That's utterly ridiculous. I'll repeat my earlier comment: It is not the Indexer's job to parse my content. Leave that to me. The Indexer just needs to index my content.
0 -
@David Smith said:
Agreed - I didn't think I'd every say "IDOL was better than this" but it's true at this point. I am getting all kinds of strange results when I attempt to search on Extended Attributes as well. It doesn't seem to be integrated properly or something.And the sad thing is that the parser really has little to do with Idol/SOLR. There are so many general purpose string parsers out there. That being said, Idol had keyview, which was performance hog, but functioned pretty well. Tika is well known and supposedly can: detect and extract metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
So there is no reason that we should have issues except that Tika is not being used properly.
0 -
@Jacqui_N said:
First, thank you for bringing this to our attention! I will forward all of your comments to the technical experts and see what I can find out for you.I certainly hope non of this is a surprise as both Smitty and I have been working with support on issues like this for months now.
0 -
Since I'm the marketer, so I don't get technical inquiries, but I'm always willing to help you find an answer. I just found out about this forum yesterday! I am sure it's not a surprise to those "in the know". I reached out to engineering earlier this morning...just waiting on a response.
Jacqui Newell
Sr Product Marketing Manager
OpenText0 -
Hi there! I heard back from the engineer and he said, "Support is the right channel they should approach for technical assistance." I saw that you have been working with support on this. Is there anything further I can do to help?
Jacqui Newell
Sr Product Marketing Manager
OpenText0 -
I received a patch this morning and am currently re-indexing my branch to find out if it works. Fingers crossed.
0 -
There aren't many formatting tools or emojis on this forum, Jacqui. I appreciate your help!
0 -
The patch I received appears to be working well. I'm a much happier camper.
1
Categories
- All Categories
- 108 Developer Announcements
- 49 Articles
- 100 General Questions
- 122 IM Services
- 40 OpenText Hackathon
- 31 Developer Tools
- 20.6K Analytics
- 4.1K AppWorks
- 8.9K Extended ECM
- 897 Cloud Fax and Notifications
- 77 Digital Asset Management
- 9.3K Documentum
- 29 eDOCS
- 120 Exstream
- 39.8K TeamSite
- 1.7K Web Experience Management
- Docker Automation
- LiveSite Content Services (LSCS) REST API
- Single Page Application (SPA) Modules
- TeamSite Add-ons
If you are interested in gaining full access to the content, you can register for a My Support account here.