Results 1 to 7 of 7

Thread: Character replacement in all fields

  1. #1

    Question Character replacement in all fields

    I am looking to replace various sets of characters with a specific character, across all fields in a dataset. What I am really trying to accomplish is removing characters with accents, and replacing with the non-accented characters. I also need to keep the case of the character.

    So, for example, I need to replace: , , , , , with A
    and replace: , , , , , with a...etc

    Is there a way to do this?

    Thank you

  2. #2
    Lavastorm Employee gmullin's Avatar
    Join Date
    May 2014
    Location
    Chicago
    Posts
    159

    Default

    As unicode toLower() works ok. Try this in a filter:

    text = " "

    emit text.unicode().toLower() as LowerCaseCharacters

  3. #3
    Lavastorm Employee
    Join Date
    Nov 2012
    Location
    Warrington, UK
    Posts
    226

    Default

    Hi,

    The attached data flow has a composite node that replaces specified accented characters with their equivalent 'simplified' characters and maintains the case.

    By default, all of the fields in the input data are processed. Alternatively, you can specify the list of fields to be processed.
    If you specify the fields then any remaining (unspecified) input fields are dropped from the output data.
    All processed fields are output with a unicode data type.

    Replace_Accented_Characters--share.brg

    Note: the node does not support multi-byte unicode characters.

    The character replacements are defined in the 'Process Fields' node within the composite.
    You can modify the two strings '_Acc_Chars' and '_Reg_Chars' as required. They are currently set as:

    _Acc_Chars= unicode("ŠŽšžŸ ")
    _Reg_Chars= unicode("SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaacee eeiiiidnooooouuuuyy")

    It is important to ensure the position of the characters is maintained else the mapping will be incorrect.
    The accented character and replacement character should be added as a pair to these strings.

    Regards,
    Adrian
    Last edited by awilliams1024; 04-13-2017 at 11:24 AM.

  4. #4
    Lavastorm Employee
    Join Date
    Nov 2012
    Location
    Warrington, UK
    Posts
    226

    Default

    Hi,

    Attached is a modified version of the original node that performs better with larger data sets than the original node, but at the expense of being a few seconds slower with small data sets.

    The original node retains the record sequence order for the whole data set. The modified version retains the record sequence order for 'slices' of (in this case) 100 records. The node interleaves the slices in the output data set, e.g. if the data has 1000 records and records 1 - 100 are in the 0xx slice and records 101 - 200 are in the 1xx slice, the output slice order will be 0xx, 3xx, 9xx, 1xx, 4xx, 7xx, 2xx, 5xx, 8xx.

    Replace_Accented_Characters_v2--share.brg

    Regards,
    Adrian
    Last edited by awilliams1024; 04-14-2017 at 09:16 AM.

  5. #5
    Lavastorm Employee
    Join Date
    Dec 2006
    Location
    Dallas, TX
    Posts
    297

    Default

    Attached is a node that does a one way conversion of Unicode strings to human readable ASCII. This would be used for reporting purposes only.
    Attached Files Attached Files

  6. #6

    Default

    This works GREAT! Thank you.

    Now, I need to do something similar. I have corrupted Unicode codes in my data (ie: \xd2, \xe4, etc.) and I need to replace them with their actual character values (, ). I am not very good with coding and I tried to edit the code for the brg you provided, but I seem to keep breaking it. Any assistance you can provide would be much appreciated.

  7. #7
    Lavastorm Employee
    Join Date
    Nov 2012
    Location
    Warrington, UK
    Posts
    226

    Default

    Can you please clarify which brg you were trying to edit?

    However, I'm not sure whether the previous approaches would work in this case if your data contains the literal strings "\xd2", "\xe4" etc; for example the string: "The ball is \xd2car's" should be "The ball is scar's".
    This is because the literal string has a length of four characters. You could use the replacement operator to change the matching escaped strings to the corresponding character, for example:

    Code:
    new_Data = Data.unicode().replace("\\xd2","").replace("\\xe4","")
    
    emit *, new_Data

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •