Page 1 of 3 123 LastLast
Results 1 to 10 of 22

Thread: Dealing with UTF-8 in BRAINscript

  1. #1
    Contributor
    Join Date
    Jan 2013
    Location
    Boston, MA
    Posts
    15

    Default Dealing with UTF-8 in BRAINscript

    Hello, everyone.

    The topic of manipulating UTF-8 strings within the nodes has come up with increasing frequency recently. To that end, I’d like to share a few tips with the community.
    1. Before you can manipulate UTF-8 data you need to make sure those fields are stored as unicode, not string type. You can do this by opening the BRD viewer and looking at the field in question. The data type the field is stored in is directly under the field name. If it’s of type “string”, it may display properly in the BRD viewer, but will not work when you attempt to operate on the string.
    2. UTF-8 encoded field names are not allowed and will cause a node to fail.

    Let’s say you have completed steps #1 and #2 above and now you want to add a filter node that excludes certain records based on the contents of a column. Normally, you would just use the BRAINscript of:
    Code:
    emit * where columnA <> “my_string”
    If my string happens to be a UTF-8 encoded string, you’ll have a problem because you can’t natively enter that string in the script. For example:
    Code:
    emit * where columnA <> “只依三“
    Will not work because UTF-8 isn’t supported in the script at the moment.

    The solution is to use the Unicode Converter tool in BRE. To do so:
    1. Open BRE
    2. Go to the View menu and select Unicode Converter
    3. Paste the UTF-8 string you want to use in the script, in our example it’s “只依三“, and click Convert
    4. You’ll get an output that looks something like “\u53ea\u4f9d\u4e09” – that’s the hexadecimal values of the characters you entered.
    5. Go back to your filter node and paste that string wherever you would use a normal string (i.e. functions, variables, etc)
    Code:
    emit * where columnA <> u”\u53ea\u4f9d\u4e09”
    Note the “u” before the string. That’s needed to indicate that the following should be treated as Unicode


    Last edited by pdespot; 06-11-2013 at 05:44 PM.

  2. #2

    Default

    Hi!

    How to correctly show utf-8 data in BRD viewer after unicode conversion?

    How to get output excel to work with utf-8 data?

    Thanks in advance!

  3. #3
    Contributor
    Join Date
    Jan 2013
    Location
    Boston, MA
    Posts
    15

    Default

    Hello, aop.

    The BRD viewer has a separate configuration for how which character set to use. To change it:
    1) Open the BRD viewer
    2) Go to the Preference menu and select "Set codepages"
    3) That will bring up a dialog box in which you can select your character set
    4) Enter "utf-8" without the quotes and click OK
    5) That should correctly display the UTF-8 characters

  4. #4

    Default

    It does work for the original utf-8 data but afterapplying the unicode function for the field it does not display correctly with that codepage.

  5. #5
    Contributor
    Join Date
    Jan 2013
    Location
    Boston, MA
    Posts
    15

    Default

    Interesting. Could you please post the graph with the node making that change and some sample data?

  6. #6

    Default

    I attached a zip below with graph and demo data.
    Attached Files Attached Files

  7. #7
    Contributor
    Join Date
    Jan 2013
    Location
    Boston, MA
    Posts
    15

    Default

    When I run the attached graph it runs and outputs the attached. Note, this is the output pin of the filter node. Also, both Excel files contain the same. Is that what you're seeing?
    brgout.jpg

  8. #8
    Lavastorm Employee
    Join Date
    Aug 2009
    Location
    Cologne
    Posts
    513

    Default

    Hey,

    The problem here is that the Delimited File node does not handle unicode data.
    It always tries to read the data in the LAE's character set.
    The LAE has a configurable character set, which is used for the string data fields and metadata.
    However, this is required to be a single-byte character set (by default this is windows-1252).

    The Delimited File node doesn't actually do any character set manipulation or character encoding/decoding.
    Therefore, the data that is on the output of the node is actually UTF-8 data, which is why if you change the display preferences in the BRD Viewer, it displays correctly.
    However, since this is in a string field, none of the other nodes are able to handle this data as they try to decode it in the LAE's character set.
    When you try and convert this to unicode, you will run into errors for this reason.

    This limitations with both the Delimited File node, and the LAE's character set are logged in our bug tracking system and will be addressed in a future release.
    This, however, requires a large development effort to overhaul all of the node infrastructure that expects all string fields to be encoded using a single byte character set, along with the various pieces of the software that are required to process output and input metadata.


    In the meantime, I have put together a node which I think should help you address the problem of acquiring delimited unicode data in the LAE.
    I have posted the node in the thread: http://community.lavastorm.com/threa...g-unicode-data

    Once the data is in a correctly encoded unicode data type, it should work correctly with the Output Excel node.

    Note that this node is not an officially Lavastorm released or supported node, therefore support will only be provided through the forums until this node or something similar can be integrated into the core product.

    Hope this helps,

    Tim.

  9. #9

    Default

    Thanks for the detailed explanation!

    Just to make sure I understood correctly so if I have a data set containing cyrillic or chinese characters (and I can encode/convert them in any encoding UTF-8,UTF16 etc. during extraction before taking them to BRE and I could input them in whatever way is available in BRE as text or throught db query etc.) currently there is no way to connect this input to an excel output and succesfully run it through?

    My current situation is that bringing in UTF-8 data to excel output fails and wrapping the fields in the data with unicode function before running the excel allows me to run the excel output but characters don't show normally.

  10. #10
    Lavastorm Employee
    Join Date
    Aug 2009
    Location
    Cologne
    Posts
    513

    Default

    Hi,

    If this is unicode data and there is no single-byte character set that can encode this data, then it will need to be read into the LAE in a unicode type field.
    For instance, if you are only reading cyrillic data, then you could set your server and BRE character set to be windows-1251, and convert the data from UTF-?? to windows-1251 outside of the LAE, then parse this data.
    However, if you need to read cyrillic data and chinese data, then there is no single character set outside of the unicode character sets that can encode these, and there is certainly no single byte character set that can do so.

    Therefore, they need to be acquired into the LAE in a unicode field.
    The post I mentioned previously, here: http://community.lavastorm.com/threa...g-unicode-data contains a node which allows you to acquire such unicode delimited data into the LAE.
    I just updated this, because I forgot to remove some base libraries that weren't necessary, and I had uploaded the node with some debug flags on, which means that it would have appeared to hang.
    If you try that now, you should be able to use it to get the unicode data into the LAE, and then using the Output Excel node, you should get the correct output spreadsheet.

    Regards,
    Tim.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •