Hacking Tutorials: Reading Ms Word's .docx File Format

We have all met that .docx file format lastly introduced by Microsoft in its word processor as much as we have also met the .xlsx file format ( which i covered in a previous tutorial ) for spreadsheets, today's goal would be for us to understand what is that .docx format about and of course, make ourselves a reader for it in C#, are you ready ?

What is a .docx file format ?

No other thing than an Office Open XML format, which as its name tell us is an XML file set ( the document file plus support xml files, for templates, formats, tables, configuration, etc ) with one of those XML files containing the actual text document, and the other as aforementioned for decoration, format and culture support.
Click here for a wiki about docx file format

In order to read the document we are gonna help ourselves with the following :

. The ICSharpCode.sharpZiplib
. The System.Xml namespace and its xml management functions
. A sample .docx file that you will find in the attachment

so, before we start you better Download the Zip lib here

and take a look to .NET System.XML namespace and methods in case this is your first met with it.

In the attachment you will find the file changes.docx, which is a propper docx file from silverlight, but if we open it with winrar you will find out that there is many files inside it.

Posted Image

As you can see, There is one xml for the document itself, where the text will be, and then you have xmls for the fonts, settings, styles, etc
We are going to focus in the document.xml only in order to extract just the document's text.

so basically, we will have to unzip the file, find the document.xml and parse it. lets do it.

If you are going to do this for yourself, here is what you should do :

. Create a windows form application
. In the form, place a button for a FileOpen dialog, which you will use to choose the .docx file to be read
. Add to your project a reference for the previously downloaded iCSharpCode.SharpZiplib.dll
. Add a new class for the DocxTextReader, and paste the following code on it :

using System;
using System.IO;
using System.Text;
using System.Xml;

using ICSharpCode.SharpZipLib.Zip;
namespace tut_reading_docx{
        class DocxTextReader
        {               
                private string file = "";
                private string location = "";
                
                // constructor, with the fileName you want to extract the text from
                public DocxTextReader(string theFile)   {               file = theFile;   }
 
                // Here the do it all method, call it after the constructor
                // it will try to find and parse document.xml from the zipped file
                // and return the docx's text in a string
                public string getDocumentText()
                {
                        if (string.IsNullOrEmpty(file))
                        {
                                throw new Exception("No Input file");
                        }
                
                        location = getDocumentXmlFile_FromZipFile();

                        if (string.IsNullOrEmpty(location))
                        {
                                throw new Exception("Invalid Docx");
                        }

                        return ReadDocumentText();
                }

                // we go to the xml file location
                // load it
                // and return the extracted text
                private string ReadDocumentText()
                {
                        StringBuilder result = new StringBuilder();

                        string bodyXPath = "/w:document/w:body";

                        ZipFile zipped = new ZipFile(file);
                        foreach (ZipEntry entry in zipped)
                        {
                                if (string.Compare(entry.Name, location, true) == 0)
                                {
                                        XmlDocument xmlDoc = new XmlDocument();
                                        xmlDoc.PreserveWhitespace = true;
                                        xmlDoc.Load(zipped.GetInputStream(entry));
                                        
                                        XmlNamespaceManager xnm = new XmlNamespaceManager(xmlDoc.NameTable);
                                        xnm.AddNamespace("w", @"http://schemas.openxmlformats.org/wordprocessingml/2006/main");

                                        XmlNode node = xmlDoc.DocumentElement.SelectSingleNode(bodyXPath, xnm);

                                        if (node == null) { return ""; }
                                        result.Append(ReadNode(node));
                                        break;
                                }
                        }
                        zipped.Close();

                        return result.ToString();
                }

                // Xml node reader helper :D
                private string ReadNode(XmlNode node)
                {
                        // not a good node ?
                        if (node == null || node.NodeType != XmlNodeType.Element) { return ""; }

                        StringBuilder result = new StringBuilder();
                        foreach (XmlNode child in node.ChildNodes)
                        {
                                // not an element node ?
                                if (child.NodeType != XmlNodeType.Element) { continue; }

                                // lets get the text, or replace the tags for the actua text's characters
                                switch (child.LocalName)
                                {
                                        case "tab": result.Append("t"); break;
                                        case "p": result.Append(ReadNode(child)); result.Append("rnrn"); break;
                                        case "cr":
                                        case "br": result.Append("rn"); break;

                                        case "t": // its Text !
                                                result.Append(child.InnerText.TrimEnd());
                                                string space = ((XmlElement)child).GetAttribute("xml:space");
                                                if (!string.IsNullOrEmpty(space) && space == "preserve") { result.Append(' ');
 }
                                        break;

                                        default:  result.Append(ReadNode(child));   break;
                                }
                        }

                        return result.ToString();
                }

                // lets open the zip file and look up for the
                // document.xml file
                // and save its zip location into the location variable
                private string getDocumentXmlFile_FromZipFile()
                {
                        // ICsharpCode helps here to open the zipped file
                        ZipFile zip = new ZipFile(file);

                        // lets take a look to the file entries inside the zip file
                        // up to we get
                        foreach (ZipEntry entry in zip)
                        {

                                if (string.Compare(entry.Name, "[Content_Types].xml", true) == 0)
                                {
                                        Stream contentTypes = zip.GetInputStream(entry);

                                        XmlDocument xmlDoc = new XmlDocument();
                                        xmlDoc.PreserveWhitespace = true;
                                        xmlDoc.Load(contentTypes);

                                        contentTypes.Close();

                                        // we need a XmlNamespaceManager for resolving namespaces
                                        XmlNamespaceManager xnm = new XmlNamespaceManager(xmlDoc.NameTable);
                                        xnm.AddNamespace("t", @"http://schemas.openxmlformats.org/package/2006/content-types");

                                        // lets find the location of document.xml
                                        XmlNode node = xmlDoc.DocumentElement.SelectSingleNode("/t:Types/t:Override[@ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"]", xnm);

                                        if (node != null)
                                        {
                                                string location = ((XmlElement)node).GetAttribute("PartName");
                                                return location.TrimStart(new char[] { '/' });
                                        }
          &a

mp;n
bsp;                             break;
                                }
                        }

                        // close the zip
                        zip.Close();

                        return null;
                }

        }
                }

you will finally get something like this :

Posted Image

Just in case, this is the way you call the reader helper.


// Create a docxReader object
DocxTextReader docxReader = new DocxTextReader(file);
// and load the readed text to you favorite textbox (multiline mode of course)
tbDocxText.Text =  docxReader.getDocumentText();

So today we learnt what is all that .docx and open office xml file format, we got ourselves introduced to icsharpcode libs which is very helpful managing zipped files and we learnt how to find our good old word's text content inside all that zipped xml thingie, not bad i would say.

Hacking Tutorials

Reading Ms Word's .docx File Format

No comments:

Post a Comment