PyCLang Notes
Page Contents
References
- Python Clang Bindings, GitHub.
- Parsing C++ in Python with Clang, Eli Bendersky's Website, July 03, 2011.
- Clang CXCuror Doxygen.
- Clang Cursor Manipulators Docygen.
- Python clang does not search system include paths, StackOverflow.
Installing, Loading LibClang, Versions That Play Nice
A Basic Install
To install on Ubuntu try:
sudo apt-get update -y sudo apt-get install -y libclang-dev
But beware, sometimes versions don't play nice together. For example, Python bindings at version 6.0.0.2 seem to require at least libclang1-8.so.1.
If you see anything like the following, you may have a bindings v.s. library version dependency issue:
>>> import clang.cindex >>> index = clang.cindex.Index.create() Traceback (most recent call last): <snip> AttributeError: /usr/lib/llvm-3.8/lib/libclang.so: undefined symbol: clang_CXXConstructor_isConvertingConstructor ^^^^^^^^ Out-of-date libclang library!!!
To install a specific version use, for example:
sudo apt install libclang1-8
Getting Python To Find LibClang
On some platforms Python doesn't seem to automatically find the LibClang library. You'll know it hasn't found the library when you see something like this:
>>> import clang.cindex >>> index = clang.cindex.Index.create() Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/clang/cindex.py", line 4129, in get_cindex_library <snip> OSError: libclang.so: cannot open shared object file: No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): <snip> clang.cindex.LibclangError: libclang.so: cannot open shared object file: No such file or directory. To provide a path to libclang use Config.set_library_path() or Config.set_library_file().
To get
it to find the library set your LD_LIBRARY_PATH
environment variable to include
a path to that library's directory [Ref].
You may also need to set DYLD_LIBRARY_PATH
. For example:
export DYLD_LIBRARY_PATH=/usr/lib/llvm-8/lib/ export LD_LIBRARY_PATH=/usr/lib/llvm-8/lib/
If you want to look at where Python is loading the library from use:
LD_DEBUG=libs python3
Do your import and index creation as normal and look at the trace output so see how it is finding your libclang library.
Debian Stretch
Getting the right version of libclang for the Python3 bindings that are auto installed using pip was a little challenging. So far I have this [Ref1][Ref2] (seems to work!):
sudo apt install software-properties-common sudo apt update sudo apt install lsb-release # For latest version: # bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" # But I want 8, so wget https://apt.llvm.org/llvm.sh chmod +x llvm.sh sudo ./llvm.sh 8 sudo ln -s /usr/lib/llvm-8/lib/libclang.so.1 /usr/lib/llvm-8/lib/libclang.so export LD_LIBRARY_PATH=/usr/lib/llvm-8/lib export DYLD_LIBRARY_PATH=/usr/lib/llvm-8/lib
But, note that this installs the entire clang toolchain, which if you only want the llibclang shared library, takes up a whole load more memory than needed - gigs worth! The installed tree can be pruned however to get rid of anything you dont need. There is probably as easier way! Sigh...
Some Examples / Playing
Here are some examples of playing around with pyclang...
I have version 6.0.0.2, installed using pip install clang
(clang install seperately).
Poo, at the moment, I cant see a way of getting the opcode of a binary operator using these bindings. There appears to be an accepted patch for this functionality, but its been hanging around for over 4 years at the time of writing... so err... not holding my breath.
PyBee seems to have added this functionality in their fork called Sealang, which they say is an improved set of Python bindings for libclang
,
but unfortunately this project is no longer maintained. I tried testing it. Although it installed the module
could not be imported due a missing symbol - I'm guessing its too out of date to work with the later libclang verions :'(
The Translation Unit - clang.cindex.TranslationUnit
The TranslationUnit
seems to have the following useful properties:
-
codeComplete(path, line, column)
Gives some auto-complete suggestions:
>> import clang.cindex >> index = clang.cindex.Index.create() >> tu = index.parse('test_files/test1.c') >> a = tu.codeComplete('test_files/test1.c', 18, 33) >> for r in a.results: print(r) ... {'const int', ResultType} | {'param1', TypedText} || Priority: 17 || Availability: Available || Brief comment: None {'const int', ResultType} | {'test_static_local', TypedText} || Priority: 17 || Availability: Available || Brief comment: None {'const int', ResultType} | {'test_static_global', TypedText} || Priority: 25 || Availability: Available || Brief comment: None {'enum MyTestEnum', ResultType} | {'TEST_ENUM_1', TypedText} || Priority: 16 || Availability: Available || Brief comment: None {'enum MyTestEnum', ResultType} | {'TEST_ENUM_2', TypedText} || Priority: 16 || Availability: Available || Brief comment: None
-
cursor
.A cursor is an abstraction that represents any element in an AST.
The cursor abstraction unifies the different kinds of entities in a program - declaration, statements, expressions, references to declarations, etc. - under a single "cursor" abstraction with a common set of operations. Common operation for a cursor include: getting the physical location in a source file where the cursor points, getting the name associated with a cursor, and retrieving cursors for any child nodes of a particular cursor.
[ref]. diagnostics
from_ast_file
from_param
from_source
get_extent(filename, locations)
get_file()
get_includes()
get_location()
get_tokens()
index
obj
reparse()
save()
spelling
- Returns the filename the TU addresses
The Cursor Abstraction - clang.cindex.Cursor
The cursor abstraction unifies the different kinds of entities in a program - declaration,
statements, expressions, references to declarations, etc. - under a single &auot;cursor"
abstraction with a common set of operations. Common operation for a cursor include: getting
the physical location in a source file where the cursor points, getting the name associated
with a cursor, and retrieving cursors for any child nodes of a particular cursor.
[ref].
The cursor
functions that are useful for navigating the AST are get_children()
,
lixical_parent()
, sematic_parent
and walk_preorder()
. From the
clang docs:
The lexical parent of a cursor is the cursor in which the given cursor was actually written. For many declarations, the lexical and semantic parents are equivalent (the semantic parent is returned by clang_getCursorSemanticParent()). They diverge when declarations or definitions are provided out-of-line. For example:
class C { void f(); }; void C::f() { }In the out-of-line definition of
C::f
, the semantic parent is the classC
, of which this function is a member. The lexical parent is the place where the declaration actually occurs in the source code; in this case, the definition occurs in the translation unit. In general, the lexical parent for a given entity can change without affecting the semantics of the program, and the lexical parent of different declarations of the same entity may be different. Changing the semantic parent of a declaration, on the other hand, can have a major impact on semantics, and redeclarations of a particular entity should all have the same semantic context.In the example above, both declarations of
C::f
haveC
as their semantic context, while the lexical context of the firstC::f
isC
and the lexical context of the secondC::f
is the translation unit.
The cursor
abstraction has the following properties/functions of interest, some
of which wrap up the C cursor manipulator functions [ref]:
access_specifier
availability
brief_comment
canonical
data
displayname
enum_type
enum_value
exception_specification_kind
extent
from_cursor_result
from_location
from_result
get_arguments
get_bitfield_width
-
get_children()
Returns a list iterator object of the immediate
cursor
children of this node.>> import clang.cindex >> index = clang.cindex.Index.create() >> tu = index.parse('test_files/test1.c') >> print(tu.cursor.get_children()) >> print("\n") >> for a in tu.cursor.get_children(): >> print(a) <list_iterator object at 0x0000020472AF4828> <clang.cindex.Cursor object at 0x0000020472A72D48> <clang.cindex.Cursor object at 0x0000020472A72DC8> <clang.cindex.Cursor object at 0x0000020472A72E48> <clang.cindex.Cursor object at 0x0000020472A72A48>
get_definition()
Use this to do from a function declaration node to its corresponding definition node.
get_field_offsetof
get_num_template_arguments
get_template_argument_kind
get_template_argument_type
get_template_argument_unsigned_value
get_template_argument_value
get_tokens
get_usr
hash
is_abstract_record
is_anonymous()
Use this to figure out if the cursor node is an anonymouse
enum
orstruct
, for example.is_bitfield
is_const_method
is_converting_constructor
is_copy_constructor
is_default_constructor
is_default_method
-
is_definition()
All functions declarations and definitions have the node type
CursorKind.FUNCTION_DECL
. To distinguish between the two, this function is used.Returns
True
if this is a definition andFalse
if it is just a declaration. For example:for cur in node.walk_preorder(): if cur.kind == clang.cindex.CursorKind.FUNCTION_DECL: if xx.is_definition(): # This is the body of the function pass else: # This is a declaration - just the function prototype pass
is_move_constructor
is_mutable_field
is_pure_virtual_method
is_scoped_enum
is_static_method
is_virtual_method
-
kind
This is a property holding the type of node in the AST that this
cursor
object represents. For example, for a function declaration it will be equal toclang.cindex.CursorKind.FUNCTION_DECL
.To see all the cursor kinds it is best to refer to cindex.py in the bindings.
Here are some of the more useful/frequent ones you might use in a C program, for example:
CursorKind Name Meaning CursorKind.STRUCT_DECL
A C or C++ struct. CursorKind.FIELD_DECL
A field (in C) or non-static data member (in C++) in a struct, union, or C++ class. CursorKind.ENUM_DECL
An enumeration. CursorKind.ENUM_CONSTANT_DECL
An enumerator constant. CursorKind.FUNCTION_DECL
A function. CursorKind.PARM_DECL
A function parameter. CursorKind.VAR_DECL
A variable. CursorKind.TYPEDEF_DECL
A typedef. lexical_parent
linkage
Tells you whether something is visible only within the compilation unit (
static
) or is visible globally:LinkageKind.INTERNAL
orLinkageKind.EXTERNAL
.location
mangled_name
objc_type_encoding
raw_comment
referenced
result_type
semantic_parent
-
spelling
A property giving the name for the entity referenced by this cursor. For example, if this cursor references a function declaration (
clang.cindex.CursorKind.FUNCTION_DECL
) for the function "MyBestFunction(...)" this property holds the string "MyBestFunction". storage_class
tls_kind
translation_unit
type
underlying_typedef_type
-
walk_preorder()
Returns a generator that allows you to visit every node in the AST in pre-order (visit the root first, then pre-order visit each child recursively).
Calling
next()
on the retured generator yields aclang.cindex.Cursor
object.
Finding Enums
I wanted to find enums, whether they were anonymous or named, and for both cases if they were hidden behind a typedef. I was only interested in globally defined enums, not enums embedded in structs or local to functions, but I've included some examples here.
- An anonymous enum:
// 1. Anonymous enum enum { ANON_ENUM_1, ... };
cursor.spelling = ""
cursor.type.spelling = "name enum (anonymous)"
cursor.is_anonymous() = True
- The AST tree representing this is:
+-- NODE: CursorKind.ENUM_DECL spel = '' (len=0) | : cur.type.spelling: enum (anonymous at test_files/test1.c:1:1) | : cur.type.kind: TypeKind.ENUM | : cur.is_anonymous: True | : cur.lexical_parent.spelling: test_files/test1.c | : cur.semantic_parent.spelling: test_files/test1.c | : cur.enum_type.spelling: int +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'ANON_ENUM_1' (len=11) | : cur.type.spelling: int | : cur.enum_value: 0 | : cur.semantic_parent.type.spelling: (anonymous at test_files/test1.c:1:1) | : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'ANON_ENUM_2' (len=11) ... ... ...
- A named enum called
bare_named
.// 2. Named enum enum Bare_Named_Enum { BARE_NAMED_ENUM_1, ... };
- There is only one enum decl.
cursor.spelling = "bare_named"
cursor.type.spelling = "enum bare_named"
cursor.is_anonymous() = False
- The AST tree respresenting this:
+-- NODE: CursorKind.ENUM_DECL spel = 'Bare_Named_Enum' (len=15) | : cur.type.spelling: enum Bare_Named_Enum | : cur.type.kind: TypeKind.ENUM | : cur.is_anonymous: False | : cur.lexical_parent.spelling: test_files/test1.c test_files/test1.c | : cur.enum_type.spelling: int +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'BARE_NAMED_ENUM_1' (len=17) | : cur.type.spelling: int | : cur.enum_value: 0 | : cur.semantic_parent.type.spelling: enum Bare_Named_Enum | : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'BARE_NAMED_ENUM_2' (len=17) ... ... ...
- A typedef'ed anonymouse enum.
// 3. Typdef'd anonymouse enum typedef enum { TYPEDEF_ANON_ENUM_1, ... } Typedef_Anonymouse_Enum_t;
cursor.spelling = ""
cursor.type.spelling = "type_t"
cursor.is_anonymous() = False
. Presumably because it is referenced by the type created.- AST:
+-- NODE: CursorKind.TYPEDEF_DECL spel = 'Typedef_Anonymouse_Enum_t' (len=25) | : cur.type.spelling: Typedef_Anonymouse_Enum_t | : cur.spelling: Typedef_Anonymouse_Enum_t | : cur.underlying_typedef_type.spelling: enum Typedef_Anonymouse_Enum_t +-- NODE: CursorKind.ENUM_DECL spel = '' (len=0) | : cur.type.spelling: Typedef_Anonymouse_Enum_t | : cur.type.kind: TypeKind.ENUM | : cur.is_anonymous(): False | : cur.lexical_parent.type.spelling: test_files/test1.c | : cur.semantic_parent.type.spelling: test_files/test1.c | : cur.enum_type.spelling: int +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_ANON_ENUM_1' (len=19) | : cur.type.spelling: int | : cur.enum_value: 0 | : cur.lexical_parent.type.spelling: enum MySecondTestEnum | : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_NAMED_ENUM_2' (len=20) ... ...
- A typedef'ed enum with a name.
// 4. Typdef'd named enum typedef enum Typdef_Named_enum { TYPEDEF_NAMED_ENUM_1, ... } Typedef_Named_Enum_t;
- There are two enum decls - one for the enum alone, and one as a child of the typedef.
cursor.spelling = "named_and_typedefed"
cursor.type.spelling = "enum named_and_typedefed"
cursor.is_anonymous() = False
- AST:
+-- NODE: CursorKind.TYPEDEF_DECL spel = 'Typedef_Named_Enum_t' (len=20) | : cur.type.spelling: Typedef_Named_Enum_t | : cur.spelling: Typedef_Named_Enum_t | : cur.underlying_typedef_type.spelling: enum Typdef_Named_enum +-- NODE: CursorKind.ENUM_DECL spel = 'Typdef_Named_enum' (len=17) | : cur.type.spelling: enum Typdef_Named_enum | : cur.kind.spelling: TypeKind.ENUM | : cur.is_anonymous(): False | : cur.lexical_parent.spelling test_files/test1.c | : cur.sementic_parent.spelling test_files/test1.c | : cur.enum_type.spelling: int +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_NAMED_ENUM_1' (len=20) | : cur.type.spelling: int | : cur.enum_value: 0 | : cur.lexical_parent.type.spelling: enum Typdef_Named_enum | : cur.semantic_parent.type.spelling: enum Typdef_Named_enum | : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_NAMED_ENUM_2' (len=20) ... ... ...
- A named enum declared inside a structure.
struct thestruct { enum enum_in_struct { ENUM_IN_STRUCT_1, ENUM_IN_STRUCT_2 } val; };
- The AST looks like this:
+-- NODE: CursorKind.STRUCT_DECL spel = 'thestruct' (len=9) +-- NODE: CursorKind.ENUM_DECL spel = 'enum_in_struct' (len=14) | | : cur.type.spelling: enum enum_in_struct | | : cur.type.kind: TypeKind.ENUM | | : cur.is_anonymous(): False | | : cur.lexical_parent.spelling: thestruct test_files/test1.c | | : cur.semantic_parent.spelling: thestruct test_files/test1.c | | : cur.enum_type.spelling: int | +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_1' (len=16) | | : cur.type.spelling: int | | : cur.enum_value: 0 | | : cur.lexical_parent.type.spelling: enum enum_in_struct | | : cur.semantic_parent.type.spelling: enum enum_in_struct | | : cur.semantic_parent.kind: CursorKind.ENUM_DECL | +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_2' (len=16) | ... +-- NODE: CursorKind.FIELD_DECL spel = 'val' (len=3) +-- NODE: CursorKind.ENUM_DECL spel = 'enum_in_struct' (len=14) | : cur.type.spelling: enum enum_in_struct | : cur.type.kind: TypeKind.ENUM | : cur.is_anonymous(): False | : cur.lexical_parent.spelling: thestruct test_files/test1.c | : cur.semantic_parent.spelling: thestruct test_files/test1.c | : cur.enum_type.spelling: int +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_1' (len=16) | : cur.type.spelling: int | : cur.enum_value: 0 | : parents: enum enum_in_struct enum enum_in_struct CursorKind.ENUM_DECL +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_2' (len=16) : cur.type.spelling: int : cur.enum_value: 1 : parents: enum enum_in_struct enum enum_in_struct CursorKind.ENUM_DECL
- The AST looks like this:
- A typedef'd enum declared in a function:
+-- NODE: CursorKind.FUNCTION_DECL spel = 'func' (len=4) +-- NODE: CursorKind.COMPOUND_STMT spel = '' (len=0) +-- NODE: CursorKind.DECL_STMT spel = '' (len=0) +-- NODE: CursorKind.ENUM_DECL spel = 'enum_in_func' (len=12) | +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_1' (len=11) | +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_2' (len=11) +-- NODE: CursorKind.TYPEDEF_DECL spel = 'Enum_In_Func_t' (len=14) +-- NODE: CursorKind.ENUM_DECL spel = 'enum_in_func' (len=12) +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_1' (len=11) +-- NODE: CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_2' (len=11)
To get the enums:
Functions
All functions are represented in the AST using CursorKind.FUNCTION_DECL
nodes. To
differentiate between declarations and definitions, the cursor function is_definition()
is used.
To go from the declaration to the definition the cursor function get_definition()
can be
used.
When a function is called, it is represented in the AST using a CursorKind.CALL_EXPR
node.
typedef int NewType_t; long func_with_params(char a, short b, NewType_t c) { return a * b * c; }
+-- NODE: CursorKind.FUNCTION_DECL spel = 'func_with_params' (len=16) | : cur.is_definition() True | : cur.linkage: LinkageKind.EXTERNAL | : cur.result_type.spelling: long | : cur.get_arguments().type.spelling: ['char', 'short', 'NewType_t'] +-- NODE: CursorKind.PARM_DECL spel = 'a' (len=1) | : cur.type.spelling: char +-- NODE: CursorKind.PARM_DECL spel = 'b' (len=1) | : cur.type.spelling: short +-- NODE: CursorKind.PARM_DECL spel = 'c' (len=1) | : cur.type.spelling: NewType_t +-- NODE: CursorKind.COMPOUND_STMT spel = '' (len=0) +-- NODE: CursorKind.RETURN_STMT spel = '' (len=0) +-- NODE: CursorKind.UNEXPOSED_EXPR spel = '' (len=0) +-- NODE: CursorKind.BINARY_OPERATOR spel = '' (len=0) | : tokens: ['a', '*', 'b', '*', 'c'] +-- NODE: CursorKind.BINARY_OPERATOR spel = '' (len=0) | | : tokens: ['a', '*', 'b'] | +-- NODE: CursorKind.UNEXPOSED_EXPR spel = 'a' (len=1) | | +-- NODE: CursorKind.UNEXPOSED_EXPR spel = 'a' (len=1) | | +-- NODE: CursorKind.DECL_REF_EXPR spel = 'a' (len=1) | | : type char | | : referenced type char | +-- NODE: CursorKind.UNEXPOSED_EXPR spel = 'b' (len=1) | +-- NODE: CursorKind.UNEXPOSED_EXPR spel = 'b' (len=1) | +-- NODE: CursorKind.DECL_REF_EXPR spel = 'b' (len=1) | : type short | : referenced type short +-- NODE: CursorKind.UNEXPOSED_EXPR spel = 'c' (len=1) +-- NODE: CursorKind.DECL_REF_EXPR spel = 'c' (len=1) : type NewType_t : referenced type NewType_t
void call_func_with_params(void) { long a; a = func_with_params('c', 10, 100); }
+-- NODE: CursorKind.FUNCTION_DECL spel = 'call_func_with_params' (len=21) | : cur.is_definition() True | : cur.get_definition().is_definition() True | : cur.linkage: LinkageKind.EXTERNAL | : cur.result_type.spelling: void +-- NODE: CursorKind.COMPOUND_STMT spel = '' (len=0) +-- NODE: CursorKind.DECL_STMT spel = '' (len=0) | +-- NODE: CursorKind.VAR_DECL spel = 'a' (len=1) +-- NODE: CursorKind.BINARY_OPERATOR spel = '' (len=0) | : tokens: ['a', '=', 'func_with_params', '(', "'c'", ',', '10', ',', '100', ')'] +-- NODE: CursorKind.DECL_REF_EXPR spel = 'a' (len=1) | : type long | : referenced type long +-- NODE: CursorKind.CALL_EXPR spel = 'func_with_params' (len=16) | : cur.type.spelling: long | : cur.get_arguments().type.spelling: ['char', 'short', 'int'] +-- NODE: CursorKind.UNEXPOSED_EXPR spel = 'func_with_params' (len=16) | +-- NODE: CursorKind.DECL_REF_EXPR spel = 'func_with_params' (len=16) | : type long (char, short, int) | : referenced type long (char, short, int) +-- NODE: CursorKind.UNEXPOSED_EXPR spel = '' (len=0) | +-- NODE: CursorKind.CHARACTER_LITERAL spel = '' (len=0) +-- NODE: CursorKind.UNEXPOSED_EXPR spel = '' (len=0) | +-- NODE: CursorKind.INTEGER_LITERAL spel = '' (len=0) | : tokens: ['10'] +-- NODE: CursorKind.INTEGER_LITERAL spel = '' (len=0) : tokens: ['100']
void use_a_function_pointer(void) { long (*ptr)(char a, short b, int c); ptr = &func_with_params; struct { void(*ptr)(char a, short b, int c); } s; s.ptr = &func_with_params; ptr(1, 2, 3); s.ptr(11, 12, 13); }