How to Structure Your Text and XML Data¶

Data management is about structure and not format!¶

If your data follows well thought out rules it will be easier for others to use. Don't worry too much over which seperator you choose (comma or double space), worry more that the seperator meaning is preserved through out your data. Similarly, you might believe you have found the perfect data format (Ascii, XML, JSON, .mat) that meets everyone's needs, but in any scientific eniviornment it is almost guarenteed that the data will have to be put in another format at some point. Once the data format is set, write down a description (schema) somewhere. This is tedious but well worth it in the long run.

Name It Well¶

Avoid Using Spaces as Delimiters in Names (No: My File.txt, Yes: MyFile.txt or My_File.txt)
Put Date Codes in ISO Format (20161116 not 11162016)
Choose a scheme with rules and stick to it, mine is Capitol Snake case with date code and iterator, the name starting with the most specific description and moving to the general followed by date (YYYYMMDD) and then a iterator, padded. An example is Two_Port_SParameters_20161116_001.txt
Use extensions, choose general extensions like txt or xml over none. For example MyData.txt is better than MyData, at least .txt tells you it is ascii encoded.
If you are using ascii, choose characters codes in the 0-126 range if at all possible. If you want to move to expanded character sets use utf-8 (same as ascii for 7bits) or html or something like that.

The Header¶

The header should contain a series of facts that are true about the measurement or file, if it changes for each measurement then put it in the data.
Long text descriptions do belong in the header, just group it all in one place, and name it Notes or Comments
To make the header more machine friendly name the facts. Best is to name the facts and the context.
The best piece of metadata is always a timestamp.

The Data¶

If the data can be written in a table with columns that are all the same type, do it that way.
Provide column names some place close to the columns, preferably the line above

An example of a ill-formed header
``` #9/30/2015 5:10 PM DMM reading = 2.809244E+1; IFBW= 3.000000E+2 (Hz); Power= -2.000000E+1 (dBm); #Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db ```

How it should look
``` #Measurement_Timestamp = 2016-11-16 21:07:47.252000 ; DMM_reading = 2.809244E+1; IFBW = 3.000000E+2; #IFBW.units = Hz ; Power = -2.000000E+1; Power.units = dBm; #Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db ```

Also good
``` !Measurement_Timestamp = 2016-11-16 21:07:47.252000 !DMM_reading = 2.809244E+1 !IFBW = 3.000000E+2 !IFBW.units = Hz !Power = -2.000000E+1 !Power.units = dBm !Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db ```

Notes:¶

It is better to use = instead of :, because timestamps frequently have : in them
Better to avoid spaces in parameter names but not a must

Column Names¶

Column names should be descriptive without spaces if possible, they should have a uniform delimiter

An example of poorly formed column names
``` ! Freq ( Hz) Real Part of S11 Imaginary Part of S11 ```

How it should look
``` !Frequency, reS11, imS11 ```

Notes:¶

Put units in the header like column_units=[Hz, None, None] or Frequency_Units=Hz
Better to avoid spaces in column names but not a must, just don't use the delimiter in the column name

Data (Tabular)¶

Make the data a stand-alone section
Avoid comments in data
Use the same character to delimit the columns
Make the columns mean a single thing and all rows the same length
Truncate the numbers, don't write them as full double precision (causes issues down the road)

An example of poorly formed data
``` 1 -.01 .01 1.11 .014 .015000101010013038 1.12 .015 ```

How it should look
``` 1 -.012 .012 1.11 .014 .015 1.12 .015 .031 ```

Putting these things together¶

# here is a code example
from pyMez import *
header=['Measurement_Timestamp = 2016-11-16 21:07:47.252000', 
'DMM_reading = 2.809244E+1',
'IFBW = 3.000000E+2', 
'IFBW.units = Hz',
'Power = -2.000000E+1',
'Power.units = dBm',
'column_units = [GHz,None,None]',        
'Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db']
column_names=['Frequency', 'reS11', 'imS11']
data=[[1,-.012, .012],
[1.11,  .014, .015],
[1.12 , .015, .031]]
table=AsciiDataTable(None,header=header,column_names=column_names,
                     data=data,comment_begin='!',comment_end='\n',column_names_delimiter=',',
                    column_names_begin_token='!',column_names_end_token='\n',data_delimiter='\t',
                     treat_header_as_comment=True,data_table_element_separator=None,
                     specific_descriptor="Example")

print("The table name is {0} \n".format(table.path))
print("*"*80+"\n")
print table

The table name is Example_Table_20161117_001.txt 

********************************************************************************

!Measurement_Timestamp = 2016-11-16 21:07:47.252000
!DMM_reading = 2.809244E+1
!IFBW = 3.000000E+2
!IFBW.units = Hz
!Power = -2.000000E+1
!Power.units = dBm
!column_units = [GHz,None,None]
!Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db
!Frequency,reS11,imS11
1	-0.012	0.012
1.11	0.014	0.015
1.12	0.015	0.031

The exact choices matter less than the pattern¶

For example this is equally as good

table.options["data_delimiter"]=","
table.options["comment_begin"]="#"
table.path=table.path.replace("Example","One_Port_SParameter")
print("The table name is {0} \n".format(table.path))
print("*"*80+"\n")
print table

The table name is One_Port_SParameter_Table_20161117_001.txt 

********************************************************************************

#Measurement_Timestamp = 2016-11-16 21:07:47.252000
#DMM_reading = 2.809244E+1
#IFBW = 3.000000E+2
#IFBW.units = Hz
#Power = -2.000000E+1
#Power.units = dBm
#column_units = [GHz,None,None]
#Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db
!Frequency,reS11,imS11
1,-0.012,0.012
1.11,0.014,0.015
1.12,0.015,0.031

In fact XML with a good pattern is great too!¶

xml_table=AsciiDataTable_to_XMLDataTable(table)
print("The table name is {0} \n".format(xml_table.path))
print("*"*80+"\n")
print xml_table

The table name is Example_Table_20161117_001.xml 

********************************************************************************

<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="../XSL/ONE_PORT_STYLE.xsl"?>
<Data_Table>
	<Data_Description>
		<Header_004>Power = -2.000000E+1</Header_004>
		<Header_005>Power.units = dBm</Header_005>
		<Header_006>column_units = [GHz,None,None]</Header_006>
		<Header_007>Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db</Header_007>
		<Header_000>Measurement_Timestamp = 2016-11-16 21:07:47.252000</Header_000>
		<Header_001>DMM_reading = 2.809244E+1</Header_001>
		<Header_002>IFBW = 3.000000E+2</Header_002>
		<Header_003>IFBW.units = Hz</Header_003>
	</Data_Description>
	<Data>
		<Tuple Frequency="1" imS11="0.012" reS11="-0.012"/>
		<Tuple Frequency="1.11" imS11="0.015" reS11="0.014"/>
		<Tuple Frequency="1.12" imS11="0.031" reS11="0.015"/>
	</Data>
</Data_Table>

Or a little more descriptive version¶

data={"Data_Description":structure_metadata(string_list_collapse(header)),
      "Data":table.get_data_dictionary_list()}
xml_table_2=DataTable(None,data_dictionary=data)
print xml_table_2

<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="../XSL/DEFAULT_MEASUREMENT_STYLE.xsl"?>
<Data_Table>
	<Data_Description>
		<Measurement_Timestamp>2016-11-16 21:07:47.252000</Measurement_Timestamp>
		<column_units>[GHz,None,None]</column_units>
		<Power>-2.000000E+1</Power>
		<DMM_reading>2.809244E+1</DMM_reading>
		<IFBW>3.000000E+2</IFBW>
		<IFBW_units>Hz</IFBW_units>
		<Comments>Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db</Comments>
		<Power_units>dBm</Power_units>
	</Data_Description>
	<Data>
		<Tuple Frequency="1" imS11="0.012" reS11="-0.012"/>
		<Tuple Frequency="1.11" imS11="0.015" reS11="0.014"/>
		<Tuple Frequency="1.12" imS11="0.031" reS11="0.015"/>
	</Data>
</Data_Table>

And this gives us a good deal of formating options also¶

from IPython.display import HTML,display
display(HTML(xml_table_2.to_HTML(os.path.join(TESTS_DIRECTORY,"../XSL/DEFAULT_MEASUREMENT_STYLE.xsl"))))

Some development work for those who are interseted in seeing how a header like this would be parsed¶

string="""
#Measurement_Timestamp = 2016-11-16 21:07:47.252000 ; DMM_Reading = 2.809244E+1; IFBW = 3.000000E+2; 
#IFBW.units = Hz ; Power = -2.000000E+1;Power.units = dBm;
#Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db"""

def structure_metadata(header_string,metadata_fact_delimiter=";",metadata_key_value_delimiter="=",comment_character="#"):
    """Strucutre Metadata returns a metadata string and returns a metadata dictionary"""
    string_list=re.split(metadata_fact_delimiter+'|\n',header_string.replace(comment_character,''))
    metadata_dictionary={}
    for string in string_list:
        pair=string.split(metadata_key_value_delimiter)
        #print pair
        #print len(pair)
        if len(pair)==2:
            key=pair[0].rstrip().lstrip().replace(".","_")
            value=pair[1].rstrip().lstrip()
            metadata_dictionary[key]=value
    return metadata_dictionary

print structure_metadata(string)

{'Measurement_Timestamp': '2016-11-16 21:07:47.252000', 'Power': '-2.000000E+1', 'DMM_Reading': '2.809244E+1', 'IFBW': '3.000000E+2', 'IFBW_units': 'Hz', 'Comments': 'Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db', 'Power_units': 'dBm'}

How to Structure Your Text and XML Data¶

Data management is about structure and not format!¶

Name It Well¶

The Header¶

The Data¶

Notes:¶

Column Names¶

Notes:¶

Data (Tabular)¶

Putting these things together¶

The exact choices matter less than the pattern¶

In fact XML with a good pattern is great too!¶

Or a little more descriptive version¶

And this gives us a good deal of formating options also¶

Data Description:

Data:

Some development work for those who are interseted in seeing how a header like this would be parsed¶

Measurement_Timestamp :	2016-11-16 21:07:47.252000
column_units :	[GHz,None,None]
Power :	-2.000000E+1
DMM_reading :	2.809244E+1
IFBW :	3.000000E+2
IFBW_units :	Hz
Comments :	Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db
Power_units :	dBm