Language detection

I'm working with Metatagger 4.0. Is it possible to detect which language the content is written in using Metatagger? For example, if the content in a file is written in Spanish, I want Metatagger to detect that and automatically set the "Language" tag on the file to be "Spanish".

Find more posts tagged with

Comments

Migrateduser

With MetaTagger 4.0.1, the language is included as part of the native metadata section. If the transconverter is not configured to use a specific language, it will identify it using a statistical signature.

Regards,
Clark

Migrateduser

Clark, thanks for the your suggestion! I found this feature of the transcoverter very handy. It meets my needs for most of the part, except for when I have some document files that have english and the russian translation of the same content side by side on the same page. In this case, the transcoverter identify the language as English only. It would be nice if it can identify both languages. Will this be possible in the future MetaTagger releases?

Migrateduser

Can't comment on future roadmap, but there is an approach you may want to take:

Add a simple CLT preprocessor in the language of your choice (perl, java, ...) would be able to scan the input text and remove content in specific Unicode character ranges. Since Russian and Latin-1 languages use different code points, you'd be able to strip out all the Russian or non-Russian characters from the input text.

Regards,
Clark

Migrateduser

Clark, I do not quite get what you are suggesting. I can potentially have a with contents written in mixed languages, for example, english and russian. For this kind of file, I'd like Metatagger to help me to set a language tag as "English, Russian".

If I write a custom content processor, it will be run after the transconverter step. If I use the default transconverter, it would have added a <language>english</language> string into the metadata already. Are you suggesting that I write my own custom transconverter which strips off all english characters, then detect what language characters are left?

Migrateduser

The rack processors are language-sensitive (word splitting, phrase finding, stemming, etc), so it's best to only have one named language going into those phases. Beyond that, your understanding is correct.

Migrateduser

Thanks, Clark. So If need to write my own transcoverter, I can't find any Metatagger documentation that explains the most basic things:

1. First of all, if I put a piece of perl code in the spot for the custom transcoverter script, what will it get as input from MetaTagger? Should it write to STDOUT?
2. my code needs to have access to the cracked text, so that it can strip off English characters. Question: is there any MetaTagger API or CLT that I can use to crack the file?
3. my code then feeds the processed text to something that can identify the language. Again, is thre any API or CLT available to help with this?

For #2 and #3 above, maybe I can invoke iwmttranscoverter twice? If so, what parameters should I pass to it?

Thanks again.