How to Structure Your Text and XML Data

Data management is about structure and not format!

If your data follows well thought out rules it will be easier for others to use. Don't worry too much over which seperator you choose (comma or double space), worry more that the seperator meaning is preserved through out your data. Similarly, you might believe you have found the perfect data format (Ascii, XML, JSON, .mat) that meets everyone's needs, but in any scientific eniviornment it is almost guarenteed that the data will have to be put in another format at some point. Once the data format is set, write down a description (schema) somewhere. This is tedious but well worth it in the long run.

Name It Well

  1. Avoid Using Spaces as Delimiters in Names (No: My File.txt, Yes: MyFile.txt or My_File.txt)
  2. Put Date Codes in ISO Format (20161116 not 11162016)
  3. Choose a scheme with rules and stick to it, mine is Capitol Snake case with date code and iterator, the name starting with the most specific description and moving to the general followed by date (YYYYMMDD) and then a iterator, padded. An example is Two_Port_SParameters_20161116_001.txt
  4. Use extensions, choose general extensions like txt or xml over none. For example MyData.txt is better than MyData, at least .txt tells you it is ascii encoded.
  5. If you are using ascii, choose characters codes in the 0-126 range if at all possible. If you want to move to expanded character sets use utf-8 (same as ascii for 7bits) or html or something like that.

The Header

  1. The header should contain a series of facts that are true about the measurement or file, if it changes for each measurement then put it in the data.
  2. Long text descriptions do belong in the header, just group it all in one place, and name it Notes or Comments
  3. To make the header more machine friendly name the facts. Best is to name the facts and the context.
  4. The best piece of metadata is always a timestamp.

The Data

  1. If the data can be written in a table with columns that are all the same type, do it that way.
  2. Provide column names some place close to the columns, preferably the line above
An example of a ill-formed header
``` #9/30/2015 5:10 PM DMM reading = 2.809244E+1; IFBW= 3.000000E+2 (Hz); Power= -2.000000E+1 (dBm); #Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db ```
How it should look
``` #Measurement_Timestamp = 2016-11-16 21:07:47.252000 ; DMM_reading = 2.809244E+1; IFBW = 3.000000E+2; #IFBW.units = Hz ; Power = -2.000000E+1; Power.units = dBm; #Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db ```
Also good
``` !Measurement_Timestamp = 2016-11-16 21:07:47.252000 !DMM_reading = 2.809244E+1 !IFBW = 3.000000E+2 !IFBW.units = Hz !Power = -2.000000E+1 !Power.units = dBm !Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db ```

Notes:

  1. It is better to use = instead of :, because timestamps frequently have : in them
  2. Better to avoid spaces in parameter names but not a must

Column Names

  1. Column names should be descriptive without spaces if possible, they should have a uniform delimiter
An example of poorly formed column names
``` ! Freq ( Hz) Real Part of S11 Imaginary Part of S11 ```
How it should look
``` !Frequency, reS11, imS11 ```

Notes:

  1. Put units in the header like column_units=[Hz, None, None] or Frequency_Units=Hz
  2. Better to avoid spaces in column names but not a must, just don't use the delimiter in the column name

Data (Tabular)

  1. Make the data a stand-alone section
  2. Avoid comments in data
  3. Use the same character to delimit the columns
  4. Make the columns mean a single thing and all rows the same length
  5. Truncate the numbers, don't write them as full double precision (causes issues down the road)
An example of poorly formed data
``` 1 -.01 .01 1.11 .014 .015000101010013038 1.12 .015 ```
How it should look
``` 1 -.012 .012 1.11 .014 .015 1.12 .015 .031 ```

Putting these things together

In [138]:
# here is a code example
from pyMez import *
header=['Measurement_Timestamp = 2016-11-16 21:07:47.252000', 
'DMM_reading = 2.809244E+1',
'IFBW = 3.000000E+2', 
'IFBW.units = Hz',
'Power = -2.000000E+1',
'Power.units = dBm',
'column_units = [GHz,None,None]',        
'Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db']
column_names=['Frequency', 'reS11', 'imS11']
data=[[1,-.012, .012],
[1.11,  .014, .015],
[1.12 , .015, .031]]
table=AsciiDataTable(None,header=header,column_names=column_names,
                     data=data,comment_begin='!',comment_end='\n',column_names_delimiter=',',
                    column_names_begin_token='!',column_names_end_token='\n',data_delimiter='\t',
                     treat_header_as_comment=True,data_table_element_separator=None,
                     specific_descriptor="Example")
In [139]:
print("The table name is {0} \n".format(table.path))
print("*"*80+"\n")
print table
The table name is Example_Table_20161117_001.txt 

********************************************************************************

!Measurement_Timestamp = 2016-11-16 21:07:47.252000
!DMM_reading = 2.809244E+1
!IFBW = 3.000000E+2
!IFBW.units = Hz
!Power = -2.000000E+1
!Power.units = dBm
!column_units = [GHz,None,None]
!Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db
!Frequency,reS11,imS11
1	-0.012	0.012
1.11	0.014	0.015
1.12	0.015	0.031

The exact choices matter less than the pattern

For example this is equally as good

In [140]:
table.options["data_delimiter"]=","
table.options["comment_begin"]="#"
table.path=table.path.replace("Example","One_Port_SParameter")
print("The table name is {0} \n".format(table.path))
print("*"*80+"\n")
print table
The table name is One_Port_SParameter_Table_20161117_001.txt 

********************************************************************************

#Measurement_Timestamp = 2016-11-16 21:07:47.252000
#DMM_reading = 2.809244E+1
#IFBW = 3.000000E+2
#IFBW.units = Hz
#Power = -2.000000E+1
#Power.units = dBm
#column_units = [GHz,None,None]
#Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db
!Frequency,reS11,imS11
1,-0.012,0.012
1.11,0.014,0.015
1.12,0.015,0.031

In fact XML with a good pattern is great too!

In [141]:
xml_table=AsciiDataTable_to_XMLDataTable(table)
print("The table name is {0} \n".format(xml_table.path))
print("*"*80+"\n")
print xml_table
The table name is Example_Table_20161117_001.xml 

********************************************************************************

<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="../XSL/ONE_PORT_STYLE.xsl"?>
<Data_Table>
	<Data_Description>
		<Header_004>Power = -2.000000E+1</Header_004>
		<Header_005>Power.units = dBm</Header_005>
		<Header_006>column_units = [GHz,None,None]</Header_006>
		<Header_007>Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db</Header_007>
		<Header_000>Measurement_Timestamp = 2016-11-16 21:07:47.252000</Header_000>
		<Header_001>DMM_reading = 2.809244E+1</Header_001>
		<Header_002>IFBW = 3.000000E+2</Header_002>
		<Header_003>IFBW.units = Hz</Header_003>
	</Data_Description>
	<Data>
		<Tuple Frequency="1" imS11="0.012" reS11="-0.012"/>
		<Tuple Frequency="1.11" imS11="0.015" reS11="0.014"/>
		<Tuple Frequency="1.12" imS11="0.031" reS11="0.015"/>
	</Data>
</Data_Table>

Or a little more descriptive version

In [147]:
data={"Data_Description":structure_metadata(string_list_collapse(header)),
      "Data":table.get_data_dictionary_list()}
xml_table_2=DataTable(None,data_dictionary=data)
print xml_table_2
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="../XSL/DEFAULT_MEASUREMENT_STYLE.xsl"?>
<Data_Table>
	<Data_Description>
		<Measurement_Timestamp>2016-11-16 21:07:47.252000</Measurement_Timestamp>
		<column_units>[GHz,None,None]</column_units>
		<Power>-2.000000E+1</Power>
		<DMM_reading>2.809244E+1</DMM_reading>
		<IFBW>3.000000E+2</IFBW>
		<IFBW_units>Hz</IFBW_units>
		<Comments>Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db</Comments>
		<Power_units>dBm</Power_units>
	</Data_Description>
	<Data>
		<Tuple Frequency="1" imS11="0.012" reS11="-0.012"/>
		<Tuple Frequency="1.11" imS11="0.015" reS11="0.014"/>
		<Tuple Frequency="1.12" imS11="0.031" reS11="0.015"/>
	</Data>
</Data_Table>

And this gives us a good deal of formating options also

In [150]:
from IPython.display import HTML,display
display(HTML(xml_table_2.to_HTML(os.path.join(TESTS_DIRECTORY,"../XSL/DEFAULT_MEASUREMENT_STYLE.xsl"))))

Data Description:

Measurement_Timestamp : 2016-11-16 21:07:47.252000
column_units : [GHz,None,None]
Power : -2.000000E+1
DMM_reading : 2.809244E+1
IFBW : 3.000000E+2
IFBW_units : Hz
Comments : Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db
Power_units : dBm

Data:

Frequency imS11 reS11
1 0.012 -0.012
1.11 0.015 0.014
1.12 0.031 0.015

Some development work for those who are interseted in seeing how a header like this would be parsed

In [151]:
string="""
#Measurement_Timestamp = 2016-11-16 21:07:47.252000 ; DMM_Reading = 2.809244E+1; IFBW = 3.000000E+2; 
#IFBW.units = Hz ; Power = -2.000000E+1;Power.units = dBm;
#Comments = Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db"""
In [146]:
def structure_metadata(header_string,metadata_fact_delimiter=";",metadata_key_value_delimiter="=",comment_character="#"):
    """Strucutre Metadata returns a metadata string and returns a metadata dictionary"""
    string_list=re.split(metadata_fact_delimiter+'|\n',header_string.replace(comment_character,''))
    metadata_dictionary={}
    for string in string_list:
        pair=string.split(metadata_key_value_delimiter)
        #print pair
        #print len(pair)
        if len(pair)==2:
            key=pair[0].rstrip().lstrip().replace(".","_")
            value=pair[1].rstrip().lstrip()
            metadata_dictionary[key]=value
    return metadata_dictionary
In [152]:
print structure_metadata(string)
{'Measurement_Timestamp': '2016-11-16 21:07:47.252000', 'Power': '-2.000000E+1', 'DMM_Reading': '2.809244E+1', 'IFBW': '3.000000E+2', 'IFBW_units': 'Hz', 'Comments': 'Q26Ra_g5 and the Song wafer with the pdms roof and su8 microfluidics. NDO, 2015/II/pg111, -20db', 'Power_units': 'dBm'}