Coccinelle

Page Contents

References

Main References:

StackOverflow Threads:

An Intro

Briefly...

Coccinelle is a program matching and transformation engine which provides the language SmPL (Semantic Patch Language) for specifying desired matches and transformations in C code. -- Coccinelle website.

Coccinelle allows you to "templatize" patches so that they can be applied over an entire codebase and match the same "pattern" in a way that abstracts out things like whitespace, variable names etc.

All Coccinelle scripts look something like this:

@@
// Metavariable declarations
@@

// Transformations

Metavariables allow you to abstract out things like types, expressions, statements etc. The transformations dictate how the code should be changed.

Run Spatch

Check your COCCI script: spatch --parse-cocci mysp.cocci

Run your COCCI script:spatch --sp-file mysp.cocci file.c or spatch --sp-file mysp.cocci --dir directory.

An Example...

An example I had recently was very similar to the ARRAY_SIZE() example. There were many places in the code where I was doing sizeof(ptr->field) / sizeof(byte_t), where ptr could be any variable name that refers to a struct pointer and field could be any member of that struct. It could appear anywhere in the code as say size_t num_blocks = sizeof(new_block->info) / sizeof(byte_t), or PrintBlockInfo(some_block, sizeof(some_block->data) / sizeof(byte_t)), for example. I wanted to be able to transform each of these to size_t num_blocks = NBLOCKS(new_block->data) and PrintBlockInfo(some_block, NBLOCKS(some_block->info)), respectively.

These transformations are not something that could be easily accomplished with a regular expression. One immediate problem is that in sizeof(ptr->field) / sizeof(byte_t), ptr needs to be of the type block_descr_t. Changing this for other types might not be the correct thing to do, and certainly would read strangely for non-block types. Coccinelle to the rescue!

@@
typedef block_descr_t;
typedef byte_t;
block_descr_t *T;
identifier F;
@@

- sizeof(T->F) / sizeof(byte_t) //< Coccinelle is clever - it knows T must be of type ...
+ NBLOCKS(F)                    //< ... `block_descr_t` to match :)

Lets create the following test file:

typedef unsigned char byte_t;
typedef struct { byte_t header; byte_t crc[2]; } block_info_t;
typedef struct { byte_t crc[2], byte_t data[20]; } block_data_t;
typedef struct { block_info_t info; block_data_t data; } block_descr_t;

void func(block_descr_t *desc)
{
	const size_t num_blocks = sizeof(desc->info) / sizeof(byte_t);

	struct { byte_t byte_array[5]; } not_a_block;
	const size_t just_bytes = sizeof(not_a_block->byte_array) / sizeof(byte_t);
}

void dump_block(block_descr_t *desc)
{
	for (size_t idx = 0; idx < sizeof(desc->info) / sizeof(byte_t); ++idx)
	{
		printf("%u", desc->info[idx]);
	}
}

void dump_block_stats(block_descr_t *desc)
{
	printf("At line %u, num data|info blocks: %u|%u\n", 
		__LINE__,
		sizeof(desc->data) / sizeof(byte_t),
		sizeof(desc->info) / sizeof(byte_t));
}

If we run the Coccinelle script on the above C file (minus the annotation comments) we get the following diff produced:

$ spatch -sp_file junk.cocci junk.c
init_defs_builtins: /usr/lib/coccinelle/standard.h
HANDLING: junk.c
diff =
--- junk.c
+++ /tmp/cocci-output-11849-4d0df1-junk.c
@@ -5,12 +5,12 @@ typedef struct { block_info_t info; bloc

 void func(block_descr_t *desc)
 {
-    const size_t num_blocks = sizeof(desc->info) / sizeof(byte_t);
+    const size_t num_blocks = NBLOCKS(info);

     struct { byte_t byte_array[5]; } not_a_block;
     const size_t just_bytes = sizeof(not_a_block->byte_array) / sizeof(byte_t);
 }

 void dump_block(block_descr_t *desc)
 {
-    for (size_t idx = 0; idx < sizeof(desc->info) / sizeof(byte_t); ++idx)
+    for (size_t idx = 0; idx < NBLOCKS(info); ++idx)
     {
         printf("%u", desc->info[idx]);
     }
@@ -20,6 +20,6 @@ void func2(block_descr_t *desc)
 {
     printf("At line %u, num data|info blocks: %u|%u\n",
         __LINE__,
-        sizeof(desc->data) / sizeof(byte_t),
-        sizeof(desc->info) / sizeof(byte_t));
+        NBLOCKS(data),
+        NBLOCKS(info));
 }

So, we can see that the replacement has been made intelligently in multiple different contexts and it has also only done the replacement for the desired type too - note how not_a_block has correctly not been transformed. Sweeeet!

The Whole Transformation Is A Match

The whole of the transformation section has to match for the transformation to be applied. This is what I meant by a "templatized" patch.

Lets have a look at an example of removing the cast from calls to malloc():

@@
type I;
expression E;
identifier p;
@@

-I *p = (I *)malloc(E);
+I *p = malloc(E);

Applying this to the following file will remove the cast:

int main(void)
{
    int *some_int_prt = (int *)malloc(10 * sizeof(int));
    // ^^^ The patch will replace the above with
    // int *some_int_prt = malloc(10 * sizeof(int));

    return 0;
}

Let's modify the patch as follows - its contrived but hey:

@@
type I;
expression E, E2;
identifier p, p2;
@@

I p2 = E2; // < We've artificially added this line for the example's sake.

-I *p = (I *)malloc(E);
+I *p = malloc(E);

Apply this to the above C file and nothing happens - no patch is created. Why? The reason is that the whole patch must be able to match some section of the program it is applied over. In the C file above, although the malloc() line will match the second rule, there is nothing matching the first rule, I p2 = E2.

The above patch would only modify a file that looked like this:

int main(void)
{
    int variable_1 = 2; // < This line is necessary so that the second patch file can match!

    int *some_int_prt = (int *)malloc(10 * sizeof(int));
    // ^^^ The patch will replace the above with
    // int *some_int_prt = malloc(10 * sizeof(int));

    return 0;
}

The patch would not, however, match this file:

int main(void)
{
    int variable_1 = 2; //< This line is necessary so that the second patch file can match!

    call_some_function();

    int *some_int_prt = (int *)malloc(10 * sizeof(int));
    // ^^^ The patch will replace the above with
    // int *some_int_prt = malloc(10 * sizeof(int));

    return 0;
}

It does not find a match for precisely the sale reason - call_some_function() does not appear in the semantic patch between the first integer variable declaration and the pointer declaration and assignment. To make it match we would have to use ellipses (...) to match arbitrary code between the first integer declaration and the pointer declaration:

@@
type I;
expression E, E2;
identifier p, p2;
@@

I p2 = E2; //< We've artificially added this line for the example's sake.
... //< Match arbirary program flow
-I *p = (I *)malloc(E);
+I *p = malloc(E);

Types Of Metavariables

Keyword	Meaning
identifier	An identifier matches any literal like 42, or "a string", for example. It also matches the names of functions, macros and variabels. It is the name of something. An identifier is the name of a structure field, a macro, a function, or a variable. It is the name of something rather than an expression that has a value. But an identifier can be used in the position of an expression as well, where it represents a variable. [Ref].
parameter [list]	Matches function parameters.
type	Matches a particular type. E.g., remove casts of `malloc()`: @@ type I; identifier D; @@ - I D = (I )malloc(...); + I *D = malloc(...);
statement	Matches any C statement, for example `if (condition) do_something(); else do_something_else();` or `{ ... }`, for example. A statement is just a standalone unit of execution and doesn’t return anything ... the sole purpose of a statement is to have side-effects [Ref]. Note that an expression followed by a semicolon is a statement and any sequence of statements surrounded by curly braces is a statement - called a Compound Statement [Ref].
expression	Matches any C expression. An expression is a combination of values and functions that are combined and interpreted by the compiler to create a new value ... the purpose of an expression is to create a value (with some possible side-effects) [Ref]. An expression metavariable can be further constrained by its type.
constant
position	A position metavariable is used by attaching it using `@` to any token, including another metavariable. Its value is the position (file, line number, etc.) of the code matched by the token. It is also possible to attach expression, declaration, type, initialiser, and statement metavariables in this manner. In that case, the metavariable is bound to the closest enclosing expression, declaration, etc. If such a metavariable is itself followed by a position metavariable, the position metavariable applies to the metavariable that it follows, and not to the attached token. This makes it possible to get eg the starting and ending position of `f(...)`, by writing `f(...)@E@p`, for expression metavariable `E` and position metavariable `p` [Ref].
declaration	A declaration metavariable matches the declaration of one or more variables, all sharing the same type specification.

Rules

We've already been using anonymous rules in the form of:

@@
// Meta variables
@@

// Transformations

Named rules are declared like so:

@rulename@
// Meta variables
@@

// Transformations

When you define a rule it, like the anonymous rules we've been defining previously, it will either match, or not match a portion of the target file.

All rules, anonymous or named, evaluate to true if they match something in the target file, and false, otherwise.

This is how we can make one rule depend on another. If rule B depends on rule A, rule B is only applied if rule A evaluates to true, i.e., matched something in the target.

The classic example is replacing sizeof(a)/sizeof(a[0]) with the macro ARRAY_SIZE(a), which requires the header file kernel.h. If the target does not include this header then using the macro will cause a compile error so we don't want to make that transformation! So, to only make the transformation when the header is included we can use two rules. The first will make sure the header is included and the second will do the actual transformation.

The first rule:

@includes_kernel_h@
@@

#include <linux/kernel.h>

If the file includes the kernel.h header file, the rule includes_kernel_h will match that line and evaluate to true.

So... the second rule (see the real deal for a full example):

@depends on includes_kernel_h@
type T;
T[] E;
@@

- (sizeof(E)/sizeof(*E))
+ ARRAY_SIZE(E)

The transformation is specified to depend on the rule includes_kernel_h. This means that the rule will run when, and only when, includes_kernel_h evaluates to true, which will only be the case when that rule finds a match in the target file.

You can also invert the depends condition: rule_name depends on !dep.

Dots

Basic Dots

Basic dots are like a wild card that matches, in regular expression terms, .*. Consider the following:

int main(void)
{
   Special_t a =  {
      // Arbitrary initialisation;
      .m1 = 10,
      .m2 = 20,
   };

   printf("Any amount of arbitrary code can go here!\n") ;
   a.m3 = rand();
   if (a.m2 == a.m3) {
      printf("You got lucky :)\n");
   }

   some_func(&a, &a);

   return 0;
}

@@
Type T;
identifier var, var2;
identifier some_func;
@@

T var = ...;
...
- some_func(&var, &var2);
+ another_newer_func(&var, &var2, false);

In the above snippet you can see how each set of elipses, ..., matches an arbitrary length part of the program. Its like a .* in a regular expression. The output is this:

--- test.c
      +++ /tmp/cocci-output-22579-7874ae-test.c
      @@ -9,6 +9,6 @@ int main(void)
              printf("You got lucky :)\n");
          }
      
      -   some_func(&a, &a);
      +   another_newer_func(&a, &a, false);
          return 0;
       }

It does what we wanted, which was to replace the function call to some_func() with another_newer_func() only for invocations thatalso declare the type.

Links...