Resources:

Digital Design 5th Edition by M. Morris Mano & Michael D Ciletti.
Computer System Architecture 3rd Edition by M. Morris Mano.
Computer Science Lessons on YouTube

First-generation languages

These are the lowest form of language, generally consisting of ones and zeros or a shorthand form, such as hexadecimal. Distinguishing data from instructions is difficult at this level because all the content looks the same.

First-generation languages may also be referred to as machine languages, and in some cases byte code, while machine language programs are often referred to as binaries.

Second-generation languages

Also called assembly languages, second-generation languages are a mere table lookup away from machine language and generally map specific bit patterns, or operation codes (opcodes), to short but memorable character sequences called mnemonics. These mnemonics help programmers remember the instructions with which they are associated.

An assembler is a tool used by programmers to translate their assembly language programs into machine language suitable for execution. In addition to instruction mnemonics, a complete assembly language generally includes directives to the assembler that help dictate the memory layout of code and data in the final binary.

Third-generation languages

These languages take another step toward the expressive capability of natural languages by introducing keywords and constructs that programmers use as the building blocks for their programs. Third-generation languages are generally platform independent, though programs written using them may be platform dependent as a result of using features unique to a specific operating system. Often-cited examples of third-generation languages include FORTRAN, C, and Java.

Programmers generally use compilers to translate their programs into assembly language or all the way to machine language (or some rough equivalent such as byte code).

When you write code in a high-level language (like Python, C, or Java), it eventually has to run as machine code (binary), which the CPU understands. Along the way, the code goes through several transformations:

Compiler

A C program is human-readable, but CPUs don’t understand high-level code. They only execute machine language, raw binary instructions specific to the CPU architecture.

A compilation system is used to transform the high-level code into machine code in stages, using four main tools:

Stage	Tool	Output
1. Preprocessing	`cpp`	`hello.i` (modified source code)
2. Compilation	`cc1`	`hello.s` (assembly)
3. Assembling	`as`	`hello.o` (object file)
4. Linking	`ld`	`hello` (executable)

1. Preprocessing

Handles directives like #include, #define, #ifdef.

It copies the content of header files (e.g., stdio.h) into your code. Why?

Because the compiler needs the declarations (like printf()), which are in the header, to generate correct code.

2. Compilation

Translates C code to assembly language (hello.s) — a symbolic version of machine code.

Assembly gives you:

Direct visibility into registers, stack, calling conventions.
A fixed, low-level language common across compilers and architectures.

3. Assembling

Assembler converts .s (assembly) into .o — a relocatable object file:

Contains binary machine code, symbol tables, relocation data.
These are not yet complete executables — they don’t know where functions like printf reside yet.
.o files expose internal function addresses, symbols — they’re often used to understand compilation artifacts during static analysis.
objdump, readelf, nm tools show you internals of .o files.

4. Linking

Combines your .o with libraries (printf.o, libc.a) into a complete executable:

Resolves external symbols (e.g., where is printf?)
Creates final machine code with fixed memory addresses.

There are two types of linking:

Static linking: copies the entire library into your executable.
Dynamic linking: adds a reference to a shared object (libc.so) that is loaded at run-time.
Linker errors are often due to unresolved symbols — useful for understanding program structure.
Dynamic linking can be hijacked using techniques like LD_PRELOAD or DLL injection.
Understanding link-time behavior is critical in symbol resolution exploits, ROP chains, etc.

Assembler

Converts assembly code into machine code (binary).

Linker (optional but important)

Combines compiled code from multiple files or libraries. Resolves function calls and memory addresses.

Disassembler

Converts machine code to assembly language. This is not perfect, but gives you low-level readable code.

Decompiler

Attempts to convert machine code to high-level source code (like C). Much harder and less accurate. It can’t recover variable names or comments.

Debuggers / Hex Editors / Analyzers

These don’t transform code, but inspect and analyze binaries.

stripping binary

When you compile a program, the compiler and linker include symbols in the final binary file. These symbols are:

Function names like (main, printf)
Variable names
Debug info (file names, line numbers)

Some symbols are needed when the linker joins multiple .o files into a single executable.

Some symbols help debugging, like seeing function names in gdb.

But after the binary is done, these symbols:

Take up space
Might reveal internal details to attackers or reverse engineers!

So the compiler/linker can strip them during build. Or you can run a tool called strip later to remove them.

If a binary is stripped, you won’t see function names → so you’ll see things like FUN_00401000 in Ghidra. You’ll have to analyze what each function does to name it yourself.

Result:

Smaller file
Harder to reverse engineer (harder, not impossible)
Runs the same, symbols aren’t needed at runtime!

file command

When you have a random file and want to know what is this?, you run:

file myProgram

It looks inside the file, not just at the name or extension. Checks for known magic numbers (special byte sequences at the start). Example:

MZ → Windows PE executable
ELF → Unix/Linux binary
#!/bin/sh → Shell script
CAFEBABE → Java .class file

With ELF binaries, it often tells you:

32-bit or 64-bit?
Statically or dynamically linked?
Stripped or unstripped?

file looks for patterns. If you fake those patterns, it’s fooled. If you edit a text file and write MZ as the first 2 bytes, It says: “Hey, looks like an MS-DOS executable!”

The Shell

You’ve got a binary. Now what? How does the machine actually run it?

The shell is just a user-space program. Reads your keystrokes (like ./hello) from the keyboard via system calls. Locates the binary (hello), uses execve() syscall to replace its memory with the hello program.