why can’t we reverse engineer .doc? a semi pointless thread, but there are some interesting messages (not to mention flame wars) in there. jetson123 says:

I think it’s an expedient combination: using object serialization for I/O makes it both easy for Microsoft to read/write data, it makes it difficult for competitors to do anything with the format on other platforms, and it forces users to upgrade their copies of Office with every new release.

This is, in fact, at the heart of what people are complaining about Microsoft: Microsoft adopts strategies that give them a quick time-to-market, lock users into upgrade paths, and that are also effectively exclusionary. I wouldn’t necessarily call that deliberately “evil”. I’m sure many people at Microsoft view it as the natural way of doing software development, and they view everybody else in the industry who bothers with standardized or well-documented formats as people who foolishly waste time and money.

DOC isn’t going to be very important in a few years anyway, Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well.

While it may help a little, serializing objects in XML format will not necessarily result in formats that are significantly more readable, accessible, or backwards compatible. To make sense of a big and complex XML model, you still need a formal definition of what it is.

This is really an issue for users and customers: users should insist that their data is in well-documented formats that remain constant and compatible across releases. That’s why many government offices have insisted on using SGML in the past.

Using serialization for document storage is simply poor engineering, whether it is done by Sun or by Microsoft or by anybody else. Skipping the step of formally defining a storage format is expedient to the company but harmful to users. In the long run, users have too much invested in their content to store it in such an ephemeral format.