Tools for Reverse Engineering Binary File Formats

On Quora I asked a question : https://www.quora.com/What-are-some-good-tools-to-help-reverse-engineer-obscure-binary-file-formats

What tools are there to help reverse engineer obscure binary file formats?

So far the answers aren’t particularly informative. Basically hex editors. But I feel there must be something more. Here’s a comment I replied to someone there :

Well, if I try to imagine “something better”, for example, I can think of a program that I could feed a lot of different example files of the same type to, and have it compare across them to identify common sequences or patterns at particular points. (This kind of software exists for analysing databases of DNA, trying to understand the genes, for example)

I see Professional Text/Hex Editor with Binary Templates has templates for describing file formats. But I’d like to have the equivalent of a simple BNF-style grammar notation to describe formats and be given instant feedback, as I’m experimenting with a grammar, whether it fails in matching any of the files of the type I’m trying to reverse engineer.

If I know some information about the kind of data that is being represented, I’d like to be able to give my tool “hints” about it, which it can use to try to deduce what means what.

For example, I’d like to be able to say “this vector image file contains just a red triangle, that one, just a red square and that other one, just a blue square. Given that, what are the likely candidates to represent red, and what are the likely candidates to represent a square?”

Surely there must be stuff like this out there somewhere?

2019 – Time for Python 3

OK, it’s 2019

Finally time to upgrade to Python3 for all my Python projects.

In particular that means ThoughtStorms wiki and Project ThoughtStorms

If you are using either the project or the thoughtstorms library then be prepared that they will all be moving to Python3 in the near future.

Given that Python2 is going to be deprecated next year, and I don’t really have the time and inclination to support both, I’m moving to Python3 this year. And all my projects and code will be, too.

2to3 makes it all pretty easy.

Extra thoughts on "Assemblage" Oriented Programming

Brief followup thoughts on the previous article. Read that first.
Classes?
Classes are really just techniques to help construct objects. In an “Assemblage” language the Assemblage or Pattern itself is the way you construct the objects. The grammar explains how to parse a plain EDN or JSON-like data-literal into the assemblage.
So perhaps we don’t need classes in our Assemblage language.
Inheritance?
Much of the experience of the last four decades is that inheritance is almost more trouble than it’s worth. The “brittle base-class” problem etc. OO practice tends towards re-use through delegating to components, rather than inheritance.
As Charles Scalfani says here, in the real world, containment / ownership relationships are far more common and far easier to reason about and work with than class or type hierarchies.
So why do OO languages emphasize, and provide special infrastructure resources to help with, inheritance hierarchies but not containment hierarchies?
Good question. The + sigil in my suggested Assemblage language is explicit language support for containment. Perhaps there could be more. And we can dispense with inheritance.
Our Assemblage language can use prototypes when something like inheritance is absolutely necessary.
Module boundaries
Defining module boundaries, loosening coupling at boundaries and data-hiding to prevent leakage of dependency through module boundaries are obviously good practices. But making each individual class a hard bounded module is a mistake. One of the pains of Java is that classes which are inevitably and rightly highly interdependent are obliged to be more stand-offish and formal with each other than they need to be.
In Assemblage programming, the Assemblage is the natural module. An entire Assemblage would likely live within a single file. The brief, elegant grammatical description of the assemblage in a single place means that it’s easy to understand the structure of the assemblage and, if necessary, when changing it, to understand that change everywhere in the assemblage’s code.
So it makes no sense to decouple objects within the assemblage from each other. Or to hide the details of the inner structure of one object in the assemblage from others in the same assemblage.
Data hiding / privacy, if supported / enforced at all by the language, should be at the level of the assemblage and not the individual objects.
At the same time, stricter, fine-grained control over mutability is more useful. Explicitly immutable fields within objects make data-hiding less of an issue as it’s impossible for other parts of the assemblage to make unexpected changes to it.

What’s wrong with OO?

Some thinking over at Quora : What’s wrong with OO?

Continuing some thoughts that have been haunting me since I originally phrased them in the answer about designing the perfect programming language : Phil Jones’s answer to How would you design the perfect programming language?.

And in response to Alan Mellor‘s comment on Phil Jones’s answer to In what direction computer science is going for next 10 years? where he says he’d prefer to program a “device swarm” in OOP.

There’s nothing wrong with OO.

Considering the kind of OO you get in Smalltalk (which is really the only OO worth thinking about), it’s a beautifully simple and elegant and powerful paradigm for programming.

The problem is OO languages.

As I put it in my answer :

It’s that mismatch between a program that thinks of itself as a network of interacting objects, and a language that has no explicit concept of “a network of interacting objects” and no convenient way to express it, that makes Java and friends such hard work. Somewhere deep in the heart of the UML is a good intuition, struggling to find a viable implementation.

These days I’m all about languages for declaring / representing data. OO is a fantastic semantics / virtual machine for thinking about programs. And it’s even fine for data. Data is just a network or assemblage of interconnected objects.

BUT …

No OO language has such an explicit concept of network or assemblage. They all stop at the level of class. Ie. the generalized form of a single object type.

But why does this need to be the case?

Why can’t we have a way of declaring a collection of classes / objects in one go?

I started thinking in terms of JSON or EDN-like notations.

But there’s an obvious problem. Object networks are fundamentally networks (or lattices) not hierarchies.

But all our notations for serializing complex data-structures into a linear text file can really only handle hierarchical data. You use indentation or brackets or XML-style tags to indicated sub-levels of the tree.

But we all know that the connections between objects in a network or system break out of hierarchies and demand a “anything can connect to anything” structure. And that becomes awkward.

Either you go for a special “network drawing” tools of the kind that UML goes for. Or you have some arbitrary and ugly way to define the hierarchy breaking connections.

But then a couple of days ago I was watching a talk which briefly covered the ML / Haskell family of languages’ algebraic data-types. Which are great for defining complex data schemas in a way that’s almost like Backus-Naur or grammatical definitions.

And it occurred to me that these would work pretty well for an assemblage of objects too.

So here’s the question … could we add some higher level aggregate command to our OO languages that looks something like this. I’m going to call this “Pattern”. But maybe “Schema” or “Assemblage” would be better.

Here goes :

Pattern make: #CarPattern from:
Car = Engine Chassi Wheel[4] ModelName
Engine = noCylinders horsepower
Chassi = width depth weight
Wheel = diameter
ModelName = string.

This is a pretty standard schema-like definition. And in Haskell you often see the algebraic types defined together so you can quickly read such a schema.

Whereas in pretty much every OO language in practice, there is so much emphasis on individual classes that the wider schema gets lost. You can’t see the forest for the trees. That’s true of Java which forces you to put each class definition in a separate file, but also Smalltalk where the Class Browser puts each class in its place in the hierarchy, usually based on type rather than affinity.

So, in my example, on compilation or execution (depending on the kind of language this is) this Pattern definition would produce a number of classes : Car, Engine, Chassi, Wheels, ModelName, with instance variables based on their terminals.

But these objects also have slots or instance variables for the components. Or in the case of Wheel, there’s a Wheel class, but the Car class has a collection or array of four of these Wheels.

I’ve long thought that it would be useful for a language to have an explicit distinction between slots containing objects that are components “internal” to an object, and slots containing other external objects that are lent to it, or that represent things outside itself.

Here I’ve added a new notation. By default the grammar defines components. But a + sigil at the front of the name makes something a reference to an external object.

Pattern make: #SchoolPattern from:
School = Classroom[] Teacher[] Pupil[] Curriculum
Classroom = id
Teacher = Name employeeNumber
Pupil = Name studentNumber
Curriculum = SchoolClass[]
SchoolClass = +Teacher +Classroom +Pupil[] name hour
Name = firstName lastName.

Note : I’ve used SchoolClass rather than Class to avoid ambiguity with the OO notion of Classes. Here it’s a collection of Students in a Classroom with a particular Teacher at a particular hour.

But our notation is explicit that these objects stored in the SchoolClass slots come from elsewhere in the system. They are not internal to, or limited to a particular SchoolClass. They can appear outside any particular SchoolClass. Perhaps there’s a Teacher who isn’t currently teaching anything this semester.

Now what about or?

Pattern make: #Vehicle from:
Vehicle = Car | Truck | Bus.

In a statically typed language like Haskell, or even a statically typed OO language like Java, this kind of feature would be meaningful. In a dynamically typed Smalltalk or Python maybe not so much but there could be value to it.

I’m in favour of languages that try to eliminate null values.

We could certainly imagine a Smalltalk-like that doesn’t allow nulls in slots by default. But has some kind of Maybe or Optional for when it’s necessary. And this can be a sigil too.

Let’s go with ?

Pattern make: #Student from:
Student = id Name email ?whatsapp
Name = firstName lastName.

Here the whatsapp value is explicitly wrapped in however the language thinks about Maybes.
And we can imagine a language with default immutability but sigils again used for explicit mutability.

We’ll use @ for mutable instance variables.

Pattern make: #PongPattern from:
Pong = Paddle[2] Ball
Paddle = x @y height color
Ball = @x @y @dx @dy color radius.

Immutable instance variables like the x co-ord of the Paddles and the colours of everything on screen are set at construction time. Only the y of the paddle and the x, y, dx and dy of the ball can change at a later time.

So how do we construct and instantiate a pattern?

Well, we already have a basic grammar for our model which can now be pressed into service in serializing / deserializing the data from our collection of objects into some kind of data-structure.

Everything which is internal must eventually bottom out as primitive numbers and strings or simple objects.

So, for example, using a hiccup-like notation :

PongPattern :constructAll
[:Pong [:Paddle 10 250 50 (Color new: 255 100 100)]
[:Paddle 490 250 50 (Color new: 100 100 255)]
[:Ball 250 250 1 -1 (Color new: 100 255 100) 4] ]

Note that parsing data-structures is a place where even in a dynamically typed language the | is useful.

Anything external must be passed in as an object reference.

What about abstract classes and interfaces? In one sense, the algebraic data gives you that for free. In a statically typed language, a Vehicle would be something that could either be a Car or a Truck or a Bus.

What about adding methods to classes?

This would be in a separate section of the schema definition. And work more like C++ or (as I understand as I’ve never used it) CLOS.

Something like :

Pattern make: #PongPattern from:
Pong = Paddle[2] Ball
Paddle = x @y height color @score
Ball = @x @y @dx @dy color radius
methods:
Ball:
testAgainst: player1 and: player2
(self @y < 3) ifTrue: [@dy = 1]. (self @y > 497) ifTrue: [@dy = -1].
(self collideWith: player1) ifTrue: [@dx = 1].
(self collideWith: player2) ifTrue: [@dx = -1].
(self @x < 0) ifTrue: [ player2 incScore. self @x := 250. self @y := 250. ]. (self @x > 500) ifTrue: [ player1 incScore.
self @x := 250.
self @y := 250.
].

etc.

Anyway … that’s the first braindump.

Just to note. This isn’t meant to be changing the semantics of a running OO system. When Patterns are made they just turn into Classes. When Patterns are constructed instantiated we create and instantiate ordinary Objects.

It’s just that this gives us a way of making the concept of network or assemblage of objects explicit within the language. And helps save us from what seems to make OO so wearisome. The fiddling with fragments. The dreaded “Ravioli code”

A Tower of Simple Systems

I posted this on Future Programming at Quora : A Tower of Simple Systems

Re-reading the answer I reposted here a couple of days ago about the “Holy Grail” of programming language design : modification.
https://www.quora.com/q/qqylyzqgpwisoecc/What-is-the-Holy-Grail-of-programming-language-design
So here’s a question.
A dumb solution …
But what if we created a system as a tower of “snapshots” of solutions?
Git and other source control systems already store our development history as a series of snapshots.
Containers already give us the way to snapshot full systems.
Things like Amazon Lambda give us the ability to spin-up functions quickly and cheaply on demand.
So …
… what if, when I build a system, I just make a system that handles the first use-case or user story.
That’s now “frozen” at the base of my tower.
Then a new use-case or user story comes in. Instead of trying to slot handling this use-case into my existing program, I just create a new, trivially simple, solution to handle it. That gets added as the next “floor” of the tower.
Now the system is up and running. A user interaction happens.
The system first tries to see if the current top-floor of the tower can handle the behaviour. If it can, fine. If not then we simply drop down to the layer of the tower below it to see if that can handle the story.
And so on …
Effectively our tower of old versions of the system is acting rather like an immutable database, “story-handlers”. For any user-story we can keep adding new handlers for it to the top of the tower (where it supersedes older versions). Other user stories which haven’t changed, we just fall down the tower until we find a handler for them.
Would this simplify trying to compose complex systems?
Obviously this will work best when each user-story handler is a purely functional program (easy and cheap to spin up) and state is kept in an external database.
What does everyone think? Is this just crazy or a viable way of approaching the problem of growing a system? Ie. embrace the fact that you’re going to get cruft (or layers of archaeology) building up anyway, so just handle that explicitly?
Can we call it “Cruft-oriented programming” 🙂 ?

On Architecture and Modules

Another long Quora Answer
Why is it important to agree on software architecture principles?
In a sense, some of this is an update on my thinking on “modularity” (eg. ThoughtStorms:DecompositionByLanguageIsProbablyAModularityMistake

Well, possibly it’s only important in an “organizational” sense.
In that people in your team or project need to be aligned in their conception of the architectural principles or they’ll be working against each other or fighting.
OTOH, are absolute / global architectural principles important?
I’m inclined to think that it’s the same as in the case of “real” architecture. Of buildings and cities.
In architecture I’m a big fan of Christopher Alexander (inventor of “Pattern Languages”) The point about Alexander’s “Timeless Way of Building” is that it IS universal, but part of the universalism is intense awareness of local conditions. It champions the traditional and “vernacular”. It’s small-c “conservative”, in the Burkean sense. Believing that “what people have been doing a lot of around here for last few hundred years is probably a good idea”
OTOH, taking a bunch of ideas that have become successful in France and dumbly re-applying them in Brazil in the name of some bogus “universality” can be, at best ridiculous, and at worst disastrous.
So there are many heuristics and patterns that are useful, but have to be seen in context. How much abstraction and indirection you might need can depend on the kind of project, the language you are using, the type of user, type of interaction, platform the software is hosted on etc.
I increasingly believe that writing good software is like being a good butcher.
Butchery is about knowing where the natural joints of the animal are and carving along them. This allows you, in the minimal number of cuts with minimal effort, to produce the maximum useful pieces of meat.
The same is true of software. It’s about getting a feel for exactly where the natural module boundaries of your application are. What things need to be decoupled and how much decoupling / indirection is needed at each boundary.
Getting that wrong is what leads to expensive problems. And being “wrong” can mean both too close coupling of things that should be more loosely coupled with extra layers of indirection. And ALSO too much unnecessary indirection / abstraction between things that naturally should be more cohesive.
We tend to teach the importance of putting in abstractions / layers of indirection, but ignore the cost of doing it when we don’t need it.
People sometimes cite Richard Gabriel’s Worse Is Better as a rather vague principle. But read it carefully and a LOT of it is about exactly this problem. How much a function should expose the caller to its own failures.
The intuition we are all taught to cultivate, “Do the Right Thing” is that the module should protect the caller as much as possible. “Worse is better” argues that in this case, the module incurs too high a cost (in complexity) from trying to protect the caller from this failure. And that it’s both “worse” but, in fact, “better” to let the caller deal with the failure.
The important point is here is that the general principle is not “never protect the caller from your failure”. Nor is it “always protect the caller from your failure”. The important lesson is that “better” is to recognise what is the right answer in this particular situation.
If “agreed architectural principles” is taken to mean dumb applications of heuristics : “always use an extra abstraction layer” then they are worse than useless. They’re positively dangerous.
If the “agreed architectural principle” is “look at what people have been doing here for a decade, understand why, and follow that” then we’re on to something.
There are patterns that are nearly essential in Java. But pointless in Python.
Java and C++ are extremely similar languages in many way … BUT if you write Java in C++ or C++ in Java you are doing things very wrong.
They are not substitutes.
Java IS suitable for writing huge systems in a way which C++ just isn’t. If you try to write the kind of mega-application that Java is used for in C++ it’s going to be horrible. Juggling that much memory allocation by hand is intractable.
Use C / C++ for small, independent low-level programs, and glue them together in something else (eg. write small independent tools orchestrated within the operating system, or as libraries called from a Python script).
OTOH, trying to write the small programs for which C / C++ are good in Java is overkill. You are going to put too many abstract boundaries and the cost of garbage collection into something that should be smaller, simpler and faster.
One thing which is particularly egregious about Java (and the C++ heritage it comes from) is that these languages have an impoverished vocabulary for talking about “boundaries” between modules. They see classes and objects as a universal solution for everything.
In fact almost all programming languages have fairly poor vocabularies.
When we think of software in terms of architecture, or in terms of the “natural joints” we start to see many kinds of joints / membranes between the parts of our systems. With many degrees of permeability. Many features of languages are about this : the inlineability of functions, hygiene in macros, scope rules, “referential transparency”, data-hiding, lazy vs. eager evaluation of function arguments, the access rules for classes, synchronous messages to objects vs. asynchronous messages to actors, go-routines, sockets, the Unix pipe, internet protocols, microservices, integration “at the glass”, integration in the database, centralization / decentralization of databases.
All these are issues to do with “what kind of boundary is there between THIS and THAT?” How permeable is the boundary? How much do dependencies leak through? What obligations does it incur? What timing commitments does it need? How is the data communicated represented? How are errors checked and controlled?
We recognise this huge variety. But our languages often don’t. Most languages try to hide the variety behind a single principle : everything is a function. Or everything is passing messages between objects defined with classes. Or everything is an actor with async. messages.
While there is something very attractive about this simplicity and uniformity. Sooner or later you find yourself somewhere where the kinds of boundaries you want between the parts DON’T correspond well to the kinds of boundaries that your simple principle defines. You think actors and immutability are way cooler than mutable objects. And then you try to write a photo-editing program that applies filters to huge bitmaps.
So one architectural issue is that our languages try to enforce a single type of boundary when we want many.
The other is arbitrary and unnecessary boundaries. The biggest culprit is the difference between what is inside the program and what is outside it. Inside the program we have function calls, messages, possibly go-routines and internal queues / async. channels. Outside we have async. pipes from the OS. And synchronous socket communication. And a separation of database engine and cache engine. And search engine. And front-end web-server. And client. And server. And microservices. And XML-RPC and SOAP. And Amazon Lambda and similar “function as a service” etc etc.
Our program is divided into files. And various languages and frameworks insist that they should be divided within the file system in a particular way.
Our systems are arbitrarily divided by the underlying architecture of our platforms. Regardless of the natural joints of our applications.
I believe that one task for the next generation of languages is to be able to describe a range of boundary / membrane types and their permeability, timing and protocol requirements, all within a coherent and simple vocabulary. And where questions like “is this communication synchronous or asynchronous”, “lazy or eager” etc. are explicit “parameters” or alternatives within our language.
The other things these languages must do is transcend the “inside the program” / “outside the program” distinction. Today we have applications which have maybe a couple of lines of code doing calculations and other “business” logic. And a huge external “extro-structure”, often dozens of config files and MVC separated directories, to represent routing, caching, permissioning etc. architecture.
We need programming languages where the large scale architecture is a first class citizen of the program. Not something which has to be laboriously built around it. And when the program compiles, it compiles to not only object code, but to automated architecture : Puppet or Ansible scripts, Continuous Integration, containerization, orchestration of Kubernetes pods. This is where the DevOps revolution needs to end up. Languages that are as fluent in talking about all the parts of our systems, and all the kinds of communication between them.
So why am I talking about this in a question about “principles”?
Because software as an art / discipline / profession / culture advances when we take informal ideas and turn them into formal code that can be executed. Heuristics and “good practice” become patterns, become libraries, and ultimately become language features.
Architectural principles will ultimately be recognised and “agreed upon” parts of our practice when they finally become part of the languages we use everyday.

Future Programming on Quora

Quora has introduced the idea of “spaces”. A kind of “blog” or curated collection of existing answers from different people, to help organize answers around particular themes.
I just created a Future Programming Space to gather my answers about various ideas in the future of programming languages.
I still intend to move my answers to this blog and the ThoughtStorms wiki. But in the meantime, there’ll hopefully be a good collection / conversation over there.