5 Jun 2017

Excel Spreadsheet Format

MS's flagship office suits are widely used everywhere, even by the calculation-extensive engineering world. There are always someone that like Excel much more than the others. From a software engineer's point of view, it might be more straightforward to extract information directly from the file itself. MS provides the COM objects. The .net platform also enables a more generic but user-friendly way of manipulating office documents including docx and xlsx files. If you'd like to stay ways from COM or .Net platform, you'd better to find another way of doing it. This article illustrates how. I will focus on excel file rather than docx. But the concept should be similar, unless you show me it's not the case :-)
Genera tally, the xlsx file could be considered as a zip - yes, it's a zipped package. By using 7zip to unzip the file to a new folder. The structures are very clear. You will have 3 folders (_rels, docProperties, xl) and one xml ([Content_Types].xml). All secrets are inside those folder. Let's find them step by step.

Goto 'xl' folder, there is a file called 'workbook.xml', where all the relationships between sheet names and r:id (resource id maybe?) was defined under <sheets> tag. By parsing this xml file, you'll have the 'sheet name -> r:id' map. For example:

From this step, all works are done inside 'xl' folder. The sheets information is saved in '_resl\workbook.xml.rels'. The tag 'Relationship' saves the 'Id' and 'Target' for each sheet you have in your original excel file. Write down the value of 'Id' - you need this later. By parsing this xml file, you'll have the 'r:id -> target' map. With the information you have from Step 1, you now have a linked relationship as 'sheet name -> r:id -> target'. One example is shown below:

Take the example shown in Step 2. The real file content was saved in 'worksheets/sheet1.xml' under 'sheetData' tag in the format of row -> c(olumn) -> v(alue). The task then would be parsing this simple xml file.

If there's named cells, it is more complicated now to track down the 'cell name -> sheet:position' relationship from the 'xl\workbook.xml' under 'definedName' tag. Then, the value could be retrieved via the same method explained in Step 1-3 as long as the method of tracking down the 'sheet name -> cell value' has been found.

No comments :

Post a Comment