Parsing html or xml

Content:
a. search if any substring matches <specificWord
b. search if any substring matches </specificWord
c. Get all the words / terms between whitespace (” “) & a symbol (in this case, equal, “=“).

==========

a. Purpose:
search if any substring matches <specificWord

● Python:
term=”Machine”
reg = r”\<” + term
retval1 = re.search(reg, substring)
retval2 = retval1.group(0)
substring:
<Machine>
Return value1:
<_sre.SRE_Match object at 0x___>
Regex:
r”\<Machine”
Return value2:
<Machine
● same with two cells above  substring:  </Machine> Return value1:
None

b. Purpose:
search if any substring matches </specificWord

● Python:
term=”Machine”
reg = r”\<\/” + term
retval = re.search(reg, substring)
#retval2 = retval1.group(0)
substring:
<Machine>
Return value1:
None
Regex:
r”\<\/Machine”
● same with two cells above substring:  </Machine> Return value1:
<_sre.SRE_Match object at 0x___>
Return value2:
</Machine

c. Purpose:
Get all the words / terms between whitespace (” “) & a symbol (in this case, equal, “=“).

● Python:
beg = “\s”
en = “=”
reg = beg + r”(.*?)” + en
retval1 = re.findall(reg, sLine, re.S)
retval2 = len(retval1)
retval3 = retval1[2]
substring:
<MC ACS=“1” Adv=“0” BrandID=“4” CommonVar=“100” ControllerID=“4”>
Return value1:
[‘   <MC ACS’, ‘Adv’, ‘BrandID’, ‘CommonVar’, ‘ControllerID’]
Regex:
r”\s(.*?)=”
Return value2:
5
Return value3:
BrandID

Syntax personal notes (Python)

Content

NoteCode Purpose Syntax Additional Information
?AN001 Dealing with Japanese character /format [String encoding] QtCore.QString.fromUtf8
  • Library: PySide
  • Keyword: try, except, AttributeError, def, UTF8 Encoding, string encoding
AN002 Using Japanese character inside the source file (.py) [File encoding] # -*- coding: utf-8 -*- Keyword: file encoding, UTF8
AN002 Print without newline  from __future__ import print_function Keyword: __future__ , print statement, print() function, Python 3, Python 2.6, end

 

AN001
Purpose: Dealing with Japanese character /format [String encoding]
Library: PySide (using QtCore)
Keyword: try, except, AttributeError, def, UTF8 Endoding, string encoding

What is
UTF8 Encoding: Unicode Transformation Format, that uses 8-bit blocks to represent characters

try:
    _fromUtf8 = QtCore.QString.fromUtf8
except AttributeError:
    def _fromUtf8(s):
        return s

Code location: at the beginning (header)

AN002
Purpose: Using Japanese character inside the source file (.py) [File encoding]
Keyword: file encoding, UTF8

# -*- coding: utf-8 -*-

Code location: at the VERY beginning (header)

AN003
Purpose: Print without newline
Keyword: __future__ , print statement, print() function, Python 3, Python 2.6, end

  • Currently &  previously (before Python 3):
    print is a statement, and the resulted printed text always end with newline (or whitespace, if print command is ended with comma ‘,’)
  • After Python 3:
    print becomes function (instead of statement) print() and argument “end” can be used to specify the end of printed text
  • How to use print() if installed python version is still Python 2.6 (and above):
    using code
from __future__ import print_function
print("\t", end="")

_

Libraries in Python

Useful modules, packages, and libraries in Python. This is a list of personal preferences (previously/currently used).
Some definitions are taken from here and here.

Purpose Modules name Additional information
GUI TkInter Traditional python UI toolkit
PyQt Qt framework
PySide Qt framework

  • QtGui
  • QtCore
Foreign function interface CTypes Package for calling functions of dlls/shared libraries
[now is included w/ Py2.5 as Standard Library ]
 XML processing ElementTree [now is included w/ Py2.5 as Standard Library]
 Database sqlite3
 Operating System services os miscellaneous OS interfaces
[Standard Library]
time time access and conversions
[Standard Library]
 Text Processing services re Regular expression
[Standard Library]
Numeric & Mathematical modules  math mathematical functions
[Standard Library]
 Data types copy shallow & deep copy operations
[Standard Library]
 Structured Markup processing tools xml.dom XML processing, Document Object Model (DOM) API
[Standard Library]
xml.dom.minidom minimal DOM implementation
[Standard Library]
 Binary Data Services codecs Codec registry and base classes
[Standard Library]
 3D-rendering related pivy python binding library for Coin3D (3D-rendering library)
Read/write office files openpyxl to read/write Ecxel 2010 files
[stored in PyPackageIndex]

CMD command, python related

Berikut ini daftar perintah cmd yang berhubungan dengan python

Purpose Command Additional Information
Installing pip (python install manager) python get-pip.py From cmd & directory where you save get-pip.py
pip install
where pip if “pip install” results error, check whether pip command refer to other program, e.g., strawberry perl (if yes, then uninstall strawberry perl)
Checking which python is associated with (if accidentally installed 2 version of python) assoc .py
  • probable output:
    .py=Python.File
ftype Python.File
  • probable output:
    Python.File=”C:\…\python.exe” “%1” %*
Installing in specific target (e.g., other directory inside specific package) pip install –target=”/Applications/FreeCAD.app/Contents/lib/python2.6/site-packages” openpyxl  *belum berhasil nyoba di FCAD
 Installing pyside pip install pyside
 To convert .ui (pyside) into .py, using pyside-uic (converter provided by pyside) pyside-uic NAMEOFUIFILE -o NAMEOFPYFILE -x
  •  pyside-uic is located in folder: python Lib\site-packages\PySide
  • switch -x: if you want to double-click the resulted .py
Help python –help help: list of options/switches/parameters
 Python version python –version
python -V

XML to tab

Print as transpose if data is too wide to view from left to right.

Python, regex experiment.

# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import re
# xml2Tab_t.py drill4Opseq.xml 20 ProductData Machine MachiningData SequenceOperation > hasil.txt

# note:
# NOf = NumberOf
# [0]funtionName, [1]XML, [2]maxNumOfColumn, [3]wantedTag

minNOfArgv = 3
maxNOfColumn = int(sys.argv[minNOfArgv-1])

# --------------------------------

def printAs(type, line):
    if type=="title":
        doPrint("\s", "=", line)
    elif type=="data":
        doPrint("\"", "\"",  line)

def storeAs(dType, tType, line, array):
    if dType=="title":
        line = re.sub(r"^\s+", "", line)
        beg = "\s"
        en = "="        
    elif dType=="data":
        beg = "\""
        en = "\""

    reg = beg + r"(.*?)" + en
    i=0
    j=0

    if tType!="transpose" and tType!="normal":
        nResult = len(re.findall(reg, line, re.S))
        if nResult < maxNOfColumn:
            if nResult!=0:
                tType="normal"            
        else:
            tType="transpose"          
                
    for match in re.finditer(reg, line, re.S):        
        if tType=="transpose":            
            if dType=="title":
                array.append([]) #
                array[i].append(match.group(1))
                i=i+1
            elif dType=="data":
                array[i].append(match.group(1))
                i=i+1
        elif tType=="normal":
            if dType=="title":
                if len(array)==0:
                    array.append([]) #
                array[0].append(match.group(1))
            elif dType=="data":
                if i==0:
                    array.append([])
                    i=len(array)
                array[i-1].append(match.group(1))            
def doPrint(beg, en, line):    
    reg = beg + r"(.*?)" + en
        
    "remove whitespace @beg.of line, esp. for title"
    line = re.sub(r"^\s+", "", line)

    if re.search(reg, line):  #biar bisa \n di luar for loop
            "print substring between beg and end"
        for match in re.finditer(reg, line, re.S):
            print(match.group(1), end="")
            print("\t", end="")
        print("")

def printArray(type, array):     
    for i in xrange(0, len(array)):
        for j in xrange(0, len(array[i])):
            print(str(array[i][j]), end="")
            print("\t", end="")
        print("")
    print("")

def main():    
    iLine = 0              # index line
    cArg = 0               # current (observed) argument
    insideTag = False
    aData = []
    
    if len(sys.argv) < minNOfArgv + 1:
        # note: sys.argv[0] is function name
        return    
        fo = open(sys.argv[1])    
    for sLine in fo:
        
        "check boundary (e.g., tag "
        for a in xrange(0, len(sys.argv) - minNOfArgv):
            reg = r"\<" + sys.argv[a + minNOfArgv]
            if re.search(reg, sLine):
                insideTag = True
                cArg = a
                iLine=1                
            reg = r"\<\/" + sys.argv[a + minNOfArgv]
            if re.search(reg, sLine):
                insideTag = False
                if sys.argv[cArg + minNOfArgv]=="ProductData":
                    print("")
                else:
                    print("<" + sys.argv[cArg + minNOfArgv]+">")
                    printArray("", aData)

        "get data"
        if insideTag:
            if sys.argv[cArg + minNOfArgv]=="ProductData":
                if iLine==2:
                    print("<"+sys.argv[cArg + minNOfArgv]+">")
                    printAs("title", sLine)
                    printAs("data", sLine)
                    print("")
                elif iLine==3:
                    printAs("title", sLine)
                    printAs("data", sLine)
                else:
                    printAs("data", sLine)
            else:
                if iLine==2:
                    aData[:]=[]
                    storeAs("title", "", sLine, aData)
                    storeAs("data", "", sLine, aData)
                else:
                    storeAs("data", "", sLine, aData)
                    
            iLine = iLine + 1    
    fo.close()    

if __name__ == "__main__":
    main()

Posting code to wordpress

Python, regex

  1. Add <pre> in the beginning of the code
  2. Add </pre> at the very end of the code
  3. If indentation is needed, use &nbsp;
    In this case, 4 backspace is converted into 4 &nbsp;
    (because double tab doesn’t seem working)

.bat (double click)

main.py wantedCode > result.txt

main.py

# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import re

def main():
    i = 0
    fh = open(sys.argv[1])    
    for line in fh:        	
	if i==0:
            print("<pre>", end="")
        line = re.sub(r"\s\s\s\s", "    ", line)
        #line = re.sub("    ", "\t", line)
        print(line, end="")
        i=i+1        

    print("</pre>", end="")
    fh.close()    

if __name__ == "__main__":
    main()

Additional:

HTML Seen
print(“&lt;pre&gt;”, end=””) print(“<pre>”, end=””)
print(“&lt;/pre&gt;”, end=””) print(“</pre>”, end=””)

Print inside tag XML

Regex, python

1. interface .bat (double click), e.g., execute.bat

main.py drill4Opseq.xml tag1 tag2 > hasil.txt

2. python script, e.g., main.py

# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import re

def printData(beg, en, line):    
    reg = beg + r"(.*?)" + en
        
    "remove whitespace @beg.of line, esp. for title"
    line = re.sub(r"^\s+", "", line)

    if re.search(reg, line):    
        "print substring between beg and end"
        for match in re.finditer(reg, line, re.S):
            print(match.group(1), end="")
            print("\t", end="")
        print("")

def main():
    i = 0
    arg = 0
    insideTag = False
    #app = QtGui.QApplication(sys.argv)
    if len(sys.argv)")
                printData("\s", "=", line)
            printData("\"", "\"",  line)
            i=i+1
    fh.close()    

if __name__ == "__main__":
    main()

ps: pakai &nbsp; buat indent 😀