By Bernard Pinon - August 2002. This document describes an implementation of a COBOL 85 parser for JavaCC.
The venerable COBOL programming language is probably one of the most difficult language for which to write a compiler. Having a strict LL(1) grammar was certainly out of the mind of those who created it, it is even likely that few of them had a clear idea bout the interest of implementing a LL(1) grammar at that time. The language is extremely verbose. There are several input formats, and many variants. Lexical parsing is context dependant. In other terms, it is both a challenge and a nightmare.
A COBOL parser can be used as a start point if you intend to write a COBOL compiler (have fun!) but can also be used to write analysis tools, source formatters, converters between various COBOL dialects, automatic documentation, translation to other languages like C or Java.
The documentation I started from was the Tandem (oops: Compaq (oops: HP today ) ) Non-Stop Cobol 85, the COBOL variant I am much used to. In order to have a comparison - Tandem Cobol is very specific, as it includes extensions to support fault-tolerance and system messaging - I installed the excellent - and free - Fujitsu COBOL that comes with a very good set of documentation, and started to write a limited grammar and test it against simple COBOL example.
Then a miracle occurred : I came across a full COBOL grammar
developed by Ralf Lämmel & Chris Verhoef : VS COBOL II grammar Version 1.0.3
which is available at http://www.cwi.nl/ and http://www.wins.uva.nl/.
This is a very good grammar, and many lines of my parser are simple
copy-and-paste of their work, but was automatically generated, and thus
had many imperfections - and one lovely typing mistake : USE WITH DEGUGGING
MODE. My grammar is about 80% similar to their, but it has been largely modified
mainly in the areas of lexical analysis, and also to remove many ambiguities.
The current implementation violates the strict COBOL orthodoxy :
In other word, if you want to parse a real-life COBOL program, you will need to write a preprocessor that will for each input line :
(NEW) For the lazy ones, you can use my brute force preprocessor.
This program is distributed under the Free Software Foundation LGPL license, which means that you can do almost whatever you want with it, including selling it or including this work in a closed-source, proprietary software, with fair restrictions. For instance, you are requested to publish all changes and fixes made within (not around) this program. You must also give your customer a way to access freely this source code. You cannot distribute it under another license without my consent.
Contributions are welcome, click on my name at the top of this page to send me an email with your contribution. If this project becomes successful, I will maybe install it in newsforge with the as-usual mailing list, bug track, and so.
Version 0.0 was purely experimental - It parsed only my programs.
Version 0.1 parses the first program of the test suite, but still do not rely on it for real life apps.
Note that the program has no main function. You will have to compile it using JavaCC. Then use a regular Java compiler. You will have to write your mainline, then call the CompilationUnit() method to start parsing a file. See also the abstract syntax tree page which gives a mainline example. You can download also the preprocessor source file.
A bit of background about COBOL. The aim of COBOL was to make the source code look like a text written in English. COBOL is the ideal language for those allergic to anything that looks like a mathematical formula.
A COBOL program is divided in divisions, sections, paragraphs, and phrases (with a final dot). Depending on the division, phrases will be either composed of description entries and clauses, or statements and clauses.
There are four divisions :
In the first three divisions, the sections and the corresponding entries are predefined. In the procedure division, the author can create sections, paragraphs, etc.. at will. A section in the procedure division can be seen as a sub-program without parameters, while a paragraph can be seen as a labeled block of statements.
Each entry, statement or clause have its own proper syntax, and sometime its own lexical rules.
Here is an example of a very simple COBOL program for those who never saw one. The programs you will come across in the real world are likely to be much more cryptic than this one... Note than the colors and fonts decoration have been manually added.
I am a free-lance Tandem (oops: Compaq Non-Stop (oops: HP-Compaq Non-Stop) ) system programmer since the B40 release of this absolutely fabulous OS - the best I have seen so far, and I have seen a lot. I have also made developments on HP-UX (in C, a good system), Solaris (in Java, no comment), various flagrances of Linux (in C and PHP, a very good system, just not home-trained yet), and I have also a very bad experience of development under Windows NT4 with ASP/VB/C++ (we throw the code away after three months of struggle and switched to Linux). You can contact me by clicking on my name on the top of this page.