I. Arena BSA
Welcome to my tutorial series "Figuring out File Formats". I will post at least one lesson per week
(until I run out of ideas) on how to reverse engineer file formats used by games.
In this first lesson we'll take a look at Bethesda Softworks' BSA format as used in The Elder Scrolls: Arena.
Let's start with a pretty simple thing: what do we expect the file to contain?
Since Arena's GLOBAL.BSA is larger than the the remaining game files together, we can expect it to be an archive format of some kind that contains most of the game assets. This also leads us to the conclusion that BSA stands for Bethesda Softworks Archive.
Next: What can we expect to find in an archive file, provided it's not too obscure?
- We'd expect to find the raw data of the actual files. We hope that it's uncompressed.
- A list that tells us were the files are and how long each file is. Length can be implied by position and vice versa.
- A length of the file list, either in number of raw bytes, number of entries, or implicit by a terminating symbol.
So, let's open our BSA file in a hex editor (I'm using HxD). Now we see something like this:
This doesn't look very informative, so let's see if we can find something more interesting. Usually information about the content of a file is stored in its header. If you can't find anything there, chances are good, it's in the footer of the file. So we'll check out if the end of the file carries any interesting data.
Obviously, the footer contains a list of file names and it looks like all entried have the same length. By highlighting one full entry, you can see its length at the bottom of the window ("Lšnge"), which is 12 in hex, or 18 in dec. To learn more about the footer table it's very usful to change the display width.
Currently, HxD shows 16 bytes as one row (16 at the top), so we we'll change it to 18 and quickly verify the uniform length of 18 by scrolling over it.
To see the table properly, it's useful to remove its offset. To do this, find the beginning of the table and remove all bytes above it. In our particular case of Arena's GLOBAL.BSA, the offset is a multiple of 18, which means it doesn't shift the rows. But in general you'll see shifted tables, so find your beginning...
... and remove everything above:
Now we can see that every line consists of two parts: The bytes from 00 to 0D seem to be the filename and the bytes 0E to 11 look like a 4-byte word.
When we scroll through the table, we find, that the file name seems to ocupy only bytes 00 to 0B instead of 0D, but we will assume that it uses the full range.
For the 4-byte word, the same applies: It only uses bytes 0E and 0F, but not 10 and 11. We will again just assume that it's actually 4-byte and that higher values simply don't appear, because no larger files are stored. The meaning of these 4byte values can now classically be either the lenght of the corresponding file in bytes, or the offset from either the beginning of the raw data, or the beginning of the BSA file. Offsets are (almost) always numbers in ascending order. Our numbers are surely not ordered, so we can conclude that they are sizes.
Next we want to find out how format knows the length of the footer table. To find out, we select the entire table and read the length that HxD tells us, which happens t be ABA2.
Since we don't know if the table length is stored in the number of bytes or the number of entries, we'll look for both. To get the number of entries, we divide ABA2(hex) by 12(hex) and get 989(hex).
Now, for some reason, integers can be stored in two different was, which are little-endian and big-endian. That means, that the number 1122(hex) (reminder: one byte carries two hex digits), can either be stored as "11 22" or "22 11".
Keeping that in mind, we will look for "AB A2", "A2 AB", "09 89" and "89 09".
Clicking redo and scrolling to the beginning of the file brings us very quick results:
Obviously, the first two bytes of our BSA file represent the number of entries in our footer table, storing 0989 as "89 09".
To check if we are right, we quickly implement a program that does the following algorithm:
- Open Global.bsa
- Read 2-byte word NumFiles
- Jump to BSAFile.Size-18*NumFiles
- Prepare two arrays of length NumFiles to store the Names and Sizes
- Iterate through the table and read 14 bytes name and 4 bytes size into the corresponding array slots
- Jump to offset 2 from the beginning of the BSA file
- Iterate through Names, create files with the corresponding names, read a chunk of the correspoinding size from the Sizes array and put it into the output file
Finally our result looks like this:
You can now check if the files were extracted correctly, by opening some of the third party format files, like XMI and see if they work.