Coccinelle
Page Contents
References
Main References:
- INTRODUCTION TO COCCINELLE AND SMPL.
- The SmPL Grammar (version 1.0.6 ).
- Finding Error Handling Bugs in OpenSSL using Coccinelle.
- Hunting bugs with Coccinelle.
- GitHub Conccinelle Repo.
- Examples.
- More examples.
StackOverflow Threads:
- Pointer issues in Coccinelle.
- Limit Coccinelle matches to expression of given type .
- Match arbitrary-depth nested fields in struct in Coccinelle.
- Detect passing pointer to uninitialized variable.
- Coccinelle help to replace a function with variable args.
- What is the correct type to use for declaring a metavariable that possibly could match either variables or members in a struct?.
- C: Convert A ? B : C into if (A) B else C.
- coccinelle: replace single letter variables (i -> ii).
- Adding missing NULL checks after malloc with coccinelle.
An Intro
Briefly...
Coccinelle is a program matching and transformation engine which provides the language SmPL (Semantic Patch Language) for specifying desired matches and transformations in C code.
-- Coccinelle website.
Coccinelle allows you to "templatize" patches so that they can be applied over an entire codebase and match the same "pattern" in a way that abstracts out things like whitespace, variable names etc.
All Coccinelle scripts look something like this:
@@ // Metavariable declarations @@ // Transformations
Metavariables allow you to abstract out things like types, expressions, statements etc. The transformations dictate how the code should be changed.
Run Spatch
Check your COCCI script: spatch --parse-cocci mysp.cocci
Run your COCCI script:spatch --sp-file mysp.cocci file.c
or spatch --sp-file mysp.cocci --dir directory
.
An Example...
An example I had recently was very similar to the ARRAY_SIZE()
example. There were many places in the code where I was doing
sizeof(ptr->field) / sizeof(byte_t)
, where ptr
could be any
variable name that refers to a struct pointer and field
could be any member of
that struct. It could appear anywhere in the code as say size_t num_blocks = sizeof(new_block->info) / sizeof(byte_t)
, or PrintBlockInfo(some_block, sizeof(some_block->data) / sizeof(byte_t))
,
for example. I wanted to be able to transform each of these to
size_t num_blocks = NBLOCKS(new_block->data)
and
PrintBlockInfo(some_block, NBLOCKS(some_block->info))
, respectively.
These transformations are not something that could be easily accomplished with a regular expression.
One immediate problem is that in sizeof(ptr->field) / sizeof(byte_t)
, ptr
needs
to be of the type block_descr_t
. Changing this for other types might not be the
correct thing to do, and certainly would read strangely for non-block types. Coccinelle to the
rescue!
@@ typedef block_descr_t; typedef byte_t; block_descr_t *T; identifier F; @@ - sizeof(T->F) / sizeof(byte_t) //< Coccinelle is clever - it knows T must be of type ... + NBLOCKS(F) //< ... `block_descr_t` to match :)
Lets create the following test file:
typedef unsigned char byte_t; typedef struct { byte_t header; byte_t crc[2]; } block_info_t; typedef struct { byte_t crc[2], byte_t data[20]; } block_data_t; typedef struct { block_info_t info; block_data_t data; } block_descr_t; void func(block_descr_t *desc) { const size_t num_blocks = sizeof(desc->info) / sizeof(byte_t); struct { byte_t byte_array[5]; } not_a_block; const size_t just_bytes = sizeof(not_a_block->byte_array) / sizeof(byte_t); } void dump_block(block_descr_t *desc) { for (size_t idx = 0; idx < sizeof(desc->info) / sizeof(byte_t); ++idx) { printf("%u", desc->info[idx]); } } void dump_block_stats(block_descr_t *desc) { printf("At line %u, num data|info blocks: %u|%u\n", __LINE__, sizeof(desc->data) / sizeof(byte_t), sizeof(desc->info) / sizeof(byte_t)); }
If we run the Coccinelle script on the above C file (minus the annotation comments) we get the following diff produced:
$ spatch -sp_file junk.cocci junk.c init_defs_builtins: /usr/lib/coccinelle/standard.h HANDLING: junk.c diff = --- junk.c +++ /tmp/cocci-output-11849-4d0df1-junk.c @@ -5,12 +5,12 @@ typedef struct { block_info_t info; bloc void func(block_descr_t *desc) { - const size_t num_blocks = sizeof(desc->info) / sizeof(byte_t); + const size_t num_blocks = NBLOCKS(info); struct { byte_t byte_array[5]; } not_a_block; const size_t just_bytes = sizeof(not_a_block->byte_array) / sizeof(byte_t); } void dump_block(block_descr_t *desc) { - for (size_t idx = 0; idx < sizeof(desc->info) / sizeof(byte_t); ++idx) + for (size_t idx = 0; idx < NBLOCKS(info); ++idx) { printf("%u", desc->info[idx]); } @@ -20,6 +20,6 @@ void func2(block_descr_t *desc) { printf("At line %u, num data|info blocks: %u|%u\n", __LINE__, - sizeof(desc->data) / sizeof(byte_t), - sizeof(desc->info) / sizeof(byte_t)); + NBLOCKS(data), + NBLOCKS(info)); }
So, we can see that the replacement has been made intelligently in multiple different
contexts and it has also only done the replacement for the desired type too - note how
not_a_block
has correctly not been transformed. Sweeeet!
The Whole Transformation Is A Match
The whole of the transformation section has to match for the transformation to be applied. This is what I meant by a "templatized" patch.
Lets have a look at an example of removing the cast from calls to malloc()
:
@@ type I; expression E; identifier p; @@ -I *p = (I *)malloc(E); +I *p = malloc(E);
Applying this to the following file will remove the cast:
int main(void) { int *some_int_prt = (int *)malloc(10 * sizeof(int)); // ^^^ The patch will replace the above with // int *some_int_prt = malloc(10 * sizeof(int)); return 0; }
Let's modify the patch as follows - its contrived but hey:
@@ type I; expression E, E2; identifier p, p2; @@ I p2 = E2; // < We've artificially added this line for the example's sake. -I *p = (I *)malloc(E); +I *p = malloc(E);
Apply this to the above C file and nothing happens - no patch is created. Why? The reason is that
the whole patch must be able to match some section of the program it is applied over.
In the C file above, although the malloc()
line will match the second rule, there
is nothing matching the first rule, I p2 = E2
.
The above patch would only modify a file that looked like this:
int main(void) { int variable_1 = 2; // < This line is necessary so that the second patch file can match! int *some_int_prt = (int *)malloc(10 * sizeof(int)); // ^^^ The patch will replace the above with // int *some_int_prt = malloc(10 * sizeof(int)); return 0; }
The patch would not, however, match this file:
int main(void) { int variable_1 = 2; //< This line is necessary so that the second patch file can match! call_some_function(); int *some_int_prt = (int *)malloc(10 * sizeof(int)); // ^^^ The patch will replace the above with // int *some_int_prt = malloc(10 * sizeof(int)); return 0; }
It does not find a match for precisely the sale reason - call_some_function()
does not appear in the semantic patch between the first integer variable declaration and
the pointer declaration and assignment. To make it match we would have to use ellipses
(...
) to match arbitrary code between the first integer declaration and the
pointer declaration:
@@ type I; expression E, E2; identifier p, p2; @@ I p2 = E2; //< We've artificially added this line for the example's sake. ... //< Match arbirary program flow -I *p = (I *)malloc(E); +I *p = malloc(E);
Types Of Metavariables
Keyword | Meaning |
identifier |
An identifier matches any literal like 42, or "a string", for example. It also matches the names of functions, macros and variabels. It is the name of something.
|
parameter [list] |
Matches function parameters. |
type |
Matches a particular type. E.g., remove casts of @@ type I; identifier D; @@ - I *D = (I *)malloc(...); + I *D = malloc(...); |
statement |
Matches any C statement, for example
Note that an expression followed by a semicolon is a statement and any |
expression |
Matches any C expression. An expression An expression metavariable can be further constrained by its type. |
constant |
|
position |
|
declaration |
A declaration metavariable matches the declaration of one or more variables, all sharing the same type specification. |
Rules
We've already been using anonymous rules in the form of:
@@ // Meta variables @@ // Transformations
Named rules are declared like so:
@rulename@ // Meta variables @@ // Transformations
When you define a rule it, like the anonymous rules we've been defining previously, it will either match, or not match a portion of the target file.
All rules, anonymous or named, evaluate to true
if they match something in the
target file, and false
, otherwise.
This is how we can make one rule depend on another. If rule B depends on rule A, rule B is
only applied if rule A evaluates to true
, i.e., matched something in the target.
The classic example is replacing sizeof(a)/sizeof(a[0])
with the macro
ARRAY_SIZE(a)
, which requires the header file kernel.h
. If the
target does not include this header then using the macro will cause a compile error so we
don't want to make that transformation! So, to only make the transformation when the header
is included we can use two rules. The first will make sure the header is included and the
second will do the actual transformation.
The first rule:
@includes_kernel_h@ @@ #include <linux/kernel.h>
If the file includes the kernel.h
header file, the rule includes_kernel_h
will match that line and evaluate to true
.
So... the second rule (see the real deal for a full example):
@depends on includes_kernel_h@ type T; T[] E; @@ - (sizeof(E)/sizeof(*E)) + ARRAY_SIZE(E)
The transformation is specified to depend on the rule includes_kernel_h
. This
means that the rule will run when, and only when, includes_kernel_h
evaluates to true
,
which will only be the case when that rule finds a match in the target file.
You can also invert the depends condition: rule_name depends on !dep
.
Dots
Basic Dots
Basic dots are like a wild card that matches, in regular expression terms, .*
. Consider the following:
int main(void) { Special_t a = { // Arbitrary initialisation; .m1 = 10, .m2 = 20, }; printf("Any amount of arbitrary code can go here!\n") ; a.m3 = rand(); if (a.m2 == a.m3) { printf("You got lucky :)\n"); } some_func(&a, &a); return 0; }
@@ Type T; identifier var, var2; identifier some_func; @@ T var = ...; ... - some_func(&var, &var2); + another_newer_func(&var, &var2, false);
In the above snippet you can see how each set of elipses, ...
, matches an arbitrary length part of the program.
Its like a .*
in a regular expression. The output is this:
--- test.c +++ /tmp/cocci-output-22579-7874ae-test.c @@ -9,6 +9,6 @@ int main(void) printf("You got lucky :)\n"); } - some_func(&a, &a); + another_newer_func(&a, &a, false); return 0; }
It does what we wanted, which was to replace the function call to some_func()
with another_newer_func()
only
for invocations thatalso declare the type.
Restrict What "..." matches with "does not match"
Positions