Tokenized Code - QB64 Wiki

Tokenized Code

From QB64 Wiki

Jump to: navigation, search

This document (made by qarnos, formatted by Cyperium) contains information on tokenized code when source was saved in QB 4.5 as binary. All research is done by qarnos. There is also a converter available that converts QuickBasic 4.5 encrypted binary BAS files to plain text to be used by any application (including QB64). You can download it here:


Download QB45BIN.BAS, Binary to Text BAS File Converter


Use the converter to convert QuickBasic Encrypted Fast Load Binary BAS files into text files that can be used by QB64.


Note: DropBox may display the QB45BIN.BAS text in your web browser! Right click to Save As.


!!!IMPORTANT!!!


Now hear this: all the information in this documentation is potentially wrong, and subject to change.


Initially, when I begin writing, I was careful to insert phrases like "seems to" and "I think" whenever I could not be at least 99% sure about a particular detail. Unfortunately, the "seems to"'s and "I think"'s were so proliferate, it made for tiresome reading. I have, therefore, decided to reserve these phrases for extreme cases, and caution the reader that "I think" should be mentally prepended to every sentence in the document.


In short: if you treat this document as gospel, you do so at your own risk.


QB45 P-code:

QB45 internally runs it's programs using what Microsoft called "Threaded P-code". P-code is an early term for what we would now more frequently call a form of "byte-code". The QB45 programs are compiled into an intermediate form, which is then executed via an interpreter (virtual machine).


"Threaded", in this case, refers to the fact that the QB45 p-code for control-flow structures such as "GOTO" and "FOR : NEXT" contains the p-code addresses for the branch points associated with those structures. For instance, the p-code for the "LOOP WHILE" statement contains the address of the first statement of the loop.


The QB45 binary save format is more or less a straight-dump of either the p-code itself or, perhaps more likely, an intermediate or syntax tree representation of the source (which will later be turned into p-code) - hence the "quick-load/save". This is supported by the observation that the "threaded" p-code addresses decribed above, for a new program, will all be set to FFFFh unless the program was RUN prior to saving. It seems QB45 caches these values within the code once they are determined.


In addition to storing the structure of the program, the code also stores some formatting information. If you have ever wondered why the QB45 IDE insisted on you formatting your code in a certain way, the reason is the limitations of what kind of formatting the p-code can handle. The "AS typename" statement, for example, has an indentation offset encoded into it, allowing you to tab it across from the "DIM" (or TYPE usertype) statement.

Overall File Structure:

HEADER:


The header has the following basic format. The sizes of the fields were determined by setting a breakpoint on the DOS file read interrupt, and are assumed to be accurate, even if I have no idea what most of the data is.

2 bytes - &h00fc 2 bytes - &h0001 2 bytes - length of next block (always 12) 12 bytes - unkown data - QB seems to ignore it 6 bytes - more unkown data, always the same 2 bytes - more unkown data

SYMBOL TABLE:


The symbol table has a 2-byte length prefix, followed by the data. The actual symbol table doesn't begin until the 82nd byte of the data. The first 82 bytes always seem to be some kind of "index" - it contains a series of 2-byte offsets to various symbols. QB45, however, seems to ignore this. Filling it with random data has no effect.


Interestingly, the word at offset 82 - the end of the index - is always set to 82. This would provide an elegant way to determine where the symbol data begins - since a symbol table offset must always be larger than the index, the list could be terminated with it's own byte offset, providing a sanity-check as backup.


The above information is somewhat besides the point. Since all symbols are referenced by their offset into the table, the meaning of the first 82 bytes is largely irrelevant. Of more importance is the format of the symbol table entries:

2 bytes - currently unkown 1 byte - flags 1 byte - length of string x bytes - symbol string

The flags are as follows:

+---+---+---+---+---+---+---+---+ | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | +---+---+---+---+---+---+---+---+ | | | +-------- Symbol is 16-bit integer +------------ Symbol is a label


If bit 1 of the flags (ie: flags AND 2) is set, then the symbol is actually a binary 16-bit integer (the length will be set to 2). This is used for small line numbers. Larger line numbers are stored as text strings.


MAIN MODULE CODE:


The code consists of a 2-byte length word, followed by the code itself, which is a series of tokens as described below.


TOKENS:


Each token in a QB45 binary file consists of a series of 16-bit words. The first word is always the token ID (in the low 10 bits), which I refer to as the PCODE. The upper six bits I call the HPARAM, for "High PARAMeter" - in honor of Microsoft Windows API "WPARAM" (Word PARAMeter) and "LPARAM" (Long PARAMeter).


The HPARAM is rarely used, even in situations when it could be. For instance, it is used to store the value of small integers (integers <= 10), even though it could store larger values. Strangely, giving QB45 an small integer token with a value larger than 10 causes the IDE to crash. Perhaps the most frequent use of the HPARAM is to store the type suffix (%&!#$) of identifiers.


Most tokens are of a fixed length, although the exact length is token dependant. Most of the time, this length is zero. That is - they consist only of the token id, with no extra data.


A handful of tokens are variable in length. These consist of the token id, followed by a length word. The length is measured from the *end* of the length word, so the minimum size for a variable-length token is 4 bytes (2 byte token id, 2 byte length = 0). An important point to note is that a variable length token is still aligned to 16-bits. So a variable-length token with length = 3 occupies 8 bytes: 2 bytes token id, 2 bytes length, 3 bytes data and 1 byte padding.


The tokens are most easily processed using a shift-reduce parsing techinique, although short-cuts may be taken since most tokens result in a "reduce" action. For a simplified example, consider "PRINT 1 + 2". The QB45 token sequence would be:

<1> <2> <+> <PRINT>

The actions the parser would take on seeing each token:

Token Action Stack before action Stack after action <1> SHIFT empty <1> <2> SHIFT <1> <1> <2> <+> REDUCE {1} + {0} <1> <2> <1 + 2> <PRINT> REDUCE PRINT {0} <1 + 2> <PRINT 1 + 2>

The number in {curly braces} refers to the offset of an item from the top of the stack.


FOOTER:

After the end of the code block is a 16-byte footer. It consists of 8 16-bit words, with the following meanings:

Word 1: Offset of the first label token data. Word 2: Offset of the first DEFxxx token data. Word 3: Offset of the first TYPE declaration. Word 4: Unknown Word 5: Total number of lines in code (including $INCLUDEd lines) Word 6: Number of $INCLUDED lines Word 7: Related to the number of variables (possibly incluing any temporaries needed) Word 8: Unknown


SUBs/FUNCTIONs


SUBs and FUNCTIONs each have their own code block. The format is identical to the main module format (including the footer), with the addition of a header with the following format:

1 byte - Unknown 2 bytes - SUB/FUNCTION name length x bytes - SUB/FUNCTION name (NOT padded to even byte) 3 bytes - Unknown 2 bytes - PCODE size

Interestingly, the first tokens of the SUB/FUNCTION consist of a procedure declaration (or course). QB45, however, takes the name of the procedure from the header, even though it is provided through a symbol table entry referenced by the token sequence.


The file terminates when no more SUB/FUNCTION blocks are available.



Navigation:
Go to Keyword Reference - Alphabetical
Go to Keyword Reference - By usage
Go to Main WIKI Page
Personal tools