JavaObjects Apache Http Client and getting UTF-8 data from response

Options

OK,

This one is a bit of a noodle scratcher. I suspect the root of the problem is that when you pull string data into an Oscript string it doesn't interpret the character set correctly.

What I'm trying to do is consume a REST call from Apache HTTP client (java library) from within Oscript. With JSON responses it works fine. My Oscript looks like this:

String body = .Handler().InvokeMethod( 'handleResponse', { response } )

The Handler() function gets me an instance of

org.apache.http.impl.client.BasicResponseHandler

and the Response object is the result from executing

org.apache.http.client.methods.HttpPost

What I noticed, and I noticed this when I cast the body Oscript string to a Byte Array, is that if there are any UTF-8 encoded characters in the response they are translated as literal bytes, so if I had say _Table des matèires_ it gets interpreted as _Table des matières_

In the Oscript string, the UTF-8 encoding of the è gets interpreted as literal ANSI string è

I tried saving this content to file using both ASCII mode and BIN mode. They give me the same result. the above sequence gets translated into 4 bytes: c3 83 c2 a8 rather than the two UTF-8 bytes I was expecting.

I suspect I've hit upon an Oscript limitation with the handling of JavaObjects and bringing Java strings over to UTF-8. Is there another approach I should consider? I think there may be a way in Java to directly stream the response entity (when the entire response is in fact a file to be downloaded), but I was hoping to only use available methods in the Apache HTTP client library which already ships with Content Server - I'm using CS 22.1 BTW.

-Hugh

Tagged:

Comments

  • In case anyone ever comes back to this, I found the solution. It looks like the Apache Client API will give you back your string either in the default char set or UTF-8 (which may or may not be the default) depending on whether the response explicitly sets UTF-8 in one of the headers. In the case of the REST services I was querying in Oscript, one of them had explicitly set the char set to UTF-8 in the header, the other, the one where the entire response body was the file I wanted to save, did not.

    The Apache HTTP client library gives you two ways to get your response.body. The typical way is to call a response handler, usually the DefaultResponseHandler's handleResponse() method. It takes no arguments except for the response you are trying to get the body from.

    The other way is to use JavaObject's invokeStaticMethod() object to call the EntityUtils class's toString() method. This method takes 1 mandatory and 1 optional parameter. The mandatory (1st) parameter is your HttpEntity from the response (which you can fetch using response.getEntity(). The second parameter is a CharSet object or a String which represents a charset, in this case simply "UTF-8". If you use this parameter, you are guaranteed that the Oscript string that you get from calling EntityUtils.toString() is a proper UTF-8 string. Otherwise, as what happened in my case, the String coming back was in US ASCII, and Oscript attempted to interpret the high-ASCII chars and convert them to UTF-8 which is why my 2-byte sequences turned into 4 byte sequences.

    Ulterior motive for posting: Half the time I post a solution to my problems, I end up being the one to use it 2 years later 😁

    -Hugh