Results 1 to 8 of 8

Thread: How to parse XML file using XMLpy node

  1. #1

    Default How to parse XML file using XMLpy node

    Hi,

    I am facing problem to parse a XML file using XMLpy node. What need to write on " FileNameexpr" and "Code" box?

    BR//
    Kumar Ashish

  2. #2
    Lavastorm Employee
    Join Date
    Aug 2009
    Location
    Cologne
    Posts
    513

    Default

    Hey,

    One of the File or FilenameExpr parameter should be set but not both.
    If you are simply reading from one file, then enter the filename in the File parameter.
    If you want to read from multiple files, then you can provide the XMLpy File node with an input. One of the fields in the input should then contain the names of the files which are to be read (one filename per record). The FilenameExpr parameter should then be used to reference the field in the input which contains the filenames.

    This is a standard interface, used on most of the Lavastorm->Acquisition library group & category (e.g. Delimited File, CSV File, Fixed Format File, Excel File).

    Regarding the Code parameter, this is where you need to write how the XML data is to be extracted from the XML file & transformed into record outputs.

    The node is documented, so you should first check the node documentation. The documentation for the parameter will appear if you hover your mouse over the parameter name. The full node documentation can be found if you select the node, and hit F1 - alternatively you can select the node and then click Help->Help On Selected Node.

    If you are still stuck, there is also a tutorial available for the XMLpy File node, which provides a bit more detail on how to use the node, including more examples.

    You can bring up this tutorial by clicking Help->Tutorials.

    Unfortunately, I can't do more at this stage than point at the help or the tutorials, because I don't know the format of the file or what exactly you are trying to read from it.

    Tim.

  3. #3

    Default

    Hey,

    I am having some trouble inputting data from complicated XML files. The problem is that I have data on both row- and header level and the field names are identical except that the fields have a different "type"-attribute on them. Here's a short example:

    <BATCH>
    <DATA>
    <DATA1 Type="Rows">
    <WANTEDFIELD>99.00</WANTEDFIELD>
    </DATA1>
    <DATA1 Type="Header">
    <WANTEDFIELD>100.00</WANTEDFIELD>
    </DATA1>
    </DATA>
    </BATCH>

    Now, I'm interested in inputting the Header level data from the field "WANTEDFIELD" but I cannot do this at the moment. Could you help me somehow with this problem? Unfortunately it's not possible for me to edit the original xml-files either as there are thousands of them.

    I have understood that it would be possible to input the Type-phrase "Header" or "Rows" with the following parts of code: "/BATCH/DATA/DATA1/@Type"; "TypeHandler(attr)" and "attr.nodeValue" but I'm guessing that is not the solution in this case as I'm not interested in the text "Header" itself. I would just need to get the data from the Header-version of the "WANTEDFIELD".

    Thank you so much in advance for any help you can provide me!

    Best regards,
    Heikki

  4. #4
    Lavastorm Employee
    Join Date
    Sep 2009
    Location
    Boston, Massachusetts, USA
    Posts
    50

    Default

    Dear Heikki,

    I am having a little bit of trouble understanding your post so I'm sorry if I am solving the wrong problem. However, in the interest of helping you along as quickly as possible, I'll take a guess at what you are doing and see if that helps. If not, please write me back with some more details, and I'll respond as best I can.

    So, it sounds like you have an input like the following:

    Code:
    <BATCH>
    <DATA>
    <DATA1 Type="Rows">
    <WANTEDFIELD>A</WANTEDFIELD>
    </DATA1>	
    <DATA1 Type="Header">
    <WANTEDFIELD>B</WANTEDFIELD>
    </DATA1>	
    </DATA>
    <DATA>
    <DATA1 Type="Rows">
    <WANTEDFIELD>C</WANTEDFIELD>
    </DATA1>	
    <DATA1 Type="Header">
    <WANTEDFIELD>D</WANTEDFIELD>
    </DATA1>	
    </DATA>
    </BATCH>
    It sounds like you do not want this output:

    Rows Header
    A B
    C D

    Rather, you want this output:

    Header/WantedField
    B
    D

    You can do this by simply reading in every DATA1 tag and only outputting the ones that you want. The following code demonstrates this:

    Code:
    data = {}
    
    @elementHandler('/BATCH/DATA/DATA1')
    def processor(element):
    	if element.Type=="Header":
    		data['WANTEDFIELD'] = str(element.WANTEDFIELD)
    		outputRecord(data)
    Tell me how that works for you! I can provide more information if that's not exactly what you need.

    Rocco
    Last edited by rpigneri; 05-18-2012 at 10:03 PM. Reason: Cleaned up a Small Typo

  5. #5

    Default

    Hi Rocco,

    Thank you very much for your reply, you understood me correctly and I got the script working now!

    I would still like to ask, how can I output from XML-files a mandatory field and an optional field, when the optional field is not present in all of the xml-files that I'm trying to import? Let's say I'm trying to get two fields, WantedField and OptionalField but the OptionalField is not present in all of the input files. The output could be similar than you presented above, one column for the Header/WantedField values and another column for the OptionalField values. If there's no optional field present on that specific row, the value could be null or N/A or something like that... I've tried this with a few different types of if-sentences but I haven't been able to get it working. I'm guessing this would need to be done with the hasattr or something like that? What would be the simplest way to do this?

    Thanks once again very much in advance!

    Heikki

  6. #6
    Lavastorm Employee
    Join Date
    Sep 2009
    Location
    Boston, Massachusetts, USA
    Posts
    50

    Default

    Dear Heikki,

    You are completely right that you need to use hasattr. The trick, however, is that you also need to make a call to the setMetadata function to ensure there is space in the output for the optional fields as well as the mandatory fields.

    For example, let's say that you have this input XML file:

    Code:
    <BATCH>
    	<DATA>
    		<DATA1 Type="Rows">
    			<WANTEDFIELD>A</WANTEDFIELD>
    		</DATA1>	
    		<DATA1 Type="Header">
    			<WANTEDFIELD>B</WANTEDFIELD>
    		</DATA1>	
    	</DATA>
    	<DATA>
    		<DATA1 Type="Rows">
    			<WANTEDFIELD>C</WANTEDFIELD>
    		</DATA1>	
    		<DATA1 Type="Header">
    			<WANTEDFIELD>D</WANTEDFIELD>
    			<OPTIONALFIELD>1</OPTIONALFIELD>
    		</DATA1>	
    	</DATA>
    </BATCH>
    In this case, our first output row will not have an OPTIONALFIELD; however, the second WANTEDFIELD will have an OPTIONALFIELD. If we were to set the output metadata of this node based solely on the first output, then we wouldn't be able to output the OPTIONALFIELD when we got to the second "Header" tag.

    The following code will parse this XML file correctly:

    Code:
    data = {}
    
    def Initialize():
    	metadata = {}
    	metadata['WANTEDFIELD'] = "String"
    	metadata['OPTIONALFIELD'] = 1
    	setMetadata(metadata)
    
    @elementHandler('/BATCH/DATA/DATA1')
    def processor(element):
    	if element.Type=="Header":
    		data['WANTEDFIELD'] = element.WANTEDFIELD
    		if hasattr(element, 'OPTIONALFIELD'):
    			data['OPTIONALFIELD'] = str(element.OPTIONALFIELD)
    		outputRecord(data)
    You'll notice that the Initialize function is new to this example. This function runs before the XML file is parsed and uses the setMetadata function to declare fields for the WANTEDFIELD and OPTIONALFIELD. One other thing that you will notice is that the values passed to setMetadata are not actual output but rather dummy values. The setMetadata function simply uses the type of these values to set the output metadata for the node. In this case, WANTEDFIELD is string while OPTIONALFIELD is an integer.

    You'll also notice the somewhat strange cast in the line "data['OPTIONALFIELD'] = str(element.OPTIONALFIELD)". This is due to a small bug in the underlying Amara XML engine. If you leave it out, you will get a somewhat mystifying error.

    Hope that helps!

    Rocco

  7. #7

    Default

    Hi Rocco,

    Thanks once again for your very detailed reply! I learned a lot more and I've now been able to read all the data I want so that's good.
    However, as the case I was working with was a little more complex (had several fields in it) than the illusory example, I ended up doing this task with the help of a few separate XMLpy-nodes. I did this because I had some errors with the XMLpy outputting only NULLs or duplicating values of the optional field into rows where there originally wasn't any value in this optional field (there were originally 107 files in which there was optional field data, but a node with the above-described code duplicated certain values to almost 8 000 files). Just out of curiosity, has anyone noted any similar activity while using the XMLpy node?

    I hope you have a nice week, you all!

    Kind regards,
    Heikki

  8. #8
    Lavastorm Employee
    Join Date
    Sep 2009
    Location
    Boston, Massachusetts, USA
    Posts
    50

    Default

    Dear Heikki,

    It sounds like you have been experiencing two different but related problems with optional fields. The first is that optional values are output on XML snippets that do not have these values. In the second case, it sounds like you are experiencing completely null fields when there are no optional values in the XML at all.

    Let's deal with the first situation first. Let's say that you have this input data:

    Code:
    <BATCH>
    	<DATA>
    		<DATA1 Type="Rows">
    			<WANTEDFIELD>A</WANTEDFIELD>
    		</DATA1>	
    		<DATA1 Type="Header">
    			<WANTEDFIELD>B</WANTEDFIELD>
    			<OPTIONALFIELD>1</OPTIONALFIELD>
    		</DATA1>	
    	</DATA>
    	<DATA>
    		<DATA1 Type="Rows">
    			<WANTEDFIELD>C</WANTEDFIELD>
    		</DATA1>	
    		<DATA1 Type="Header">
    			<WANTEDFIELD>D</WANTEDFIELD>
    		</DATA1>	
    	</DATA>
    	<DATA>
    		<DATA1 Type="Rows">
    			<WANTEDFIELD>E</WANTEDFIELD>
    		</DATA1>	
    		<DATA1 Type="Header">
    			<WANTEDFIELD>F</WANTEDFIELD>
    			<OPTIONALFIELD>2</OPTIONALFIELD>
    		</DATA1>	
    	</DATA>
    </BATCH>
    With the code that I last provided, you would receive this output:

    Wanted Optional
    B 1
    D 1
    F 2

    However, what you really want is this output.

    Wanted Optional
    B 1
    D NULL
    F 2

    The following code will solve this problem very easily. The highlighted lines show the differences between this solution and the last one:

    Code:
    def Initialize():
    	metadata = {}
    	metadata['WANTEDFIELD'] = "String"
    	metadata['OPTIONALFIELD'] = 1
    	setMetadata(metadata)
    
    @elementHandler('/BATCH/DATA/DATA1')
    def processor(element):
    	data = {}
    	if element.Type=="Header":
    		data['WANTEDFIELD'] = element.WANTEDFIELD
    		if hasattr(element, 'OPTIONALFIELD'):
    			data['OPTIONALFIELD'] = str(element.OPTIONALFIELD)
    		outputRecord(data)
    You'll notice that I simply moved the "data = {}" line from outside the function call to inside the processor function call. This change has the node clear out the data output structure each time we find a DATA1 element. That way, information from the last time the node was run will not appear in the the next call.

    As for situations where there are no optional values at all, it is easiest simply to output a field of all nulls since this field has to be declared in the metadata in case there are any output values. In addition, nulls take up very little space on disk so there is a very minimal performance hit even if you have a lot of optional fields. It is possible to build an LAE framework that could distinguish between these two cases and run one out of two XMLpy File nodes, but that would be very complicated for only a little benefit.

    Hope that helps!

    Rocco

Similar Threads

  1. Opening a file where directory and file name change
    By TheBishop in forum Data Acquisition
    Replies: 3
    Last Post: 06-10-2010, 03:54 AM
  2. XML Parse
    By xathras in forum Nodes
    Replies: 2
    Last Post: 05-21-2010, 09:41 AM
  3. Replies: 2
    Last Post: 03-26-2010, 03:11 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •