Wednesday, September 14, 2011

Netlink sockets

Recently I was requested to add IPv6 support to an existing IPv4 configuration interface written in C. The existing interface was written on top of a Linux OS, and it was used to bring up links, configure routes, attach alias addresses to interfaces, create VLAN interfaces and so on. Well, I rolled up my sleeves and begun the work. After a few hours of browsing the internet, I noticed that there really was not so much information about how to do all this. It seems to me that IPv6 is not yet so widely used. After a while I however was directed to netlink sockets.


First instructions about how to use netlink sockets, suggested using some wrapper library like libnl. However after bunch of unsuccessfull attempts I gave up with libnl (my mind was really not compatible with libnl documentation) and decided to use raw netlink sockets. Basic usage information for netlink sockets was easily avaliable. There is RFC discussing the netlink sockets, as well as man pages. However these do not seem to cover all pitfalls or usage information. In my journey to making this IPv6 interface to work I fell in more than one trap, and eventually ended up adding prints to kernel to see where the excution ended up, and where my requests were discarded...

But let's start the actual business.

Linux networking will be discussed in terms of links (interfaces), addresses and routes. For me these words mean something like:

link (interface), for example eth0 - a door to outer world. Or eth0.2, virtual interface (VLAN interface), nevertheless seen as a door to outer world by the system.

address: For example 192.168.1.77/32, or fe80::211:43ff:fe26:2b6c/64, a label telling what is behind the door - and address of one specific door.

route: for example 0.0.0.0 via dev eth0, or 10.34.143.0 255.255.255.0 gw 192.168.1.55 metric 2
which tells the system to kick a packet out through a specific door, if destination matches to certain something.

I guess I will shortly explain routes here. But before I do that, I'll mention that even though the netlink sockets are a way to configure and handle all these (well, if underlying drivers do support this - I guess they nowadays often do), they're also way more. Netlink is quite generic interface to exchange information between kernel and userspace. I know at least firewalls and neighbor cache can be managed via netlink interface on linux. However we concentrate on links, addresses and routes here.

Now, if we look at the previous route examples.

route 1 was format 0.0.0.0 via dev eth0
destination 0.0.0.0 (special, 'any' ip - means this is a default route, and all packacges for which we do not have better routing information will use this route.)
dev eth0 specifies interface in which this package should be directed.

route 2 was 10.34.143.0 255.255.255.0 gw 192.168.1.55 metric 2

destination 10.34.143.0
mask 255.255.255.0

This tells us that if package is destined to address 10.34.143.xx, then this route is selected. The mask information tells which bits of the destination address are meaningfull. Eg, all bits which are not zero in mask, will be meaningfull. destination/mask pair
192.168.254.0
255.255.254.0
would mean that packets targeted to 255.255.254.xx or 255.255.255.xx would go to this route. Another way to express this is using format < address > / < amout of meaningfull bits >

destination 10.34.143.0
mask 255.255.255.0

could be shown as
10.34.143.0/24 (mask 255.255.255.0 has 24 meaningfull bits since each 0-255 value occupies 8 bits. Thus 255.255.255 is 8+8+8 = 24)

gw means gateway. This is something like saying that the specified destinations are located behind a machine which address is xx.
Eg, saying that send these packets to gw machine, it knows where they should be redirected. Hence all packets going to this route will be first delivered to 192.168.1.55.

Metric 2... Metric is a way for us to define routes which are overlapping. Eg, if we have two routes, which' destinations would match, then we use
1. route with more accurate mask (Eg, if we specify route with mask 255.255.255.255 == host route, saying that destination address is exact, then this route will be preferred over routes with looser mask).
2. Route with smaller metric value.

With IPv6 the routing gets a bit more complicated, since some of the subnet information is built into addresses. I won't get into that now, hopefully you know what you're doing :)


So let's see. The netlink socket interface is typical socket interface. Eg, we send messages to socket, and receive replies from socket. We can also register for receiving reports from interesting events, and receive messages informing the changes. I however only show the typical send request = > receive responce sequences.

As mentioned, requests are divided to link, address and route requests. (families). And each of these have request types of

creating new < address/interface/route >
deleting old < address/interface/route >
getting information about existing < address/interface/route >

Exact defines (which also need to be filled in request) are:



RTM_NEWADDR
RTM_DELADDR,
RTM_GETADDR,

RTM_NEWROUTE
RTM_DELROUTE,
RTM_GETROUTE,

RTM_NEWLINK
RTM_DELLINK
RTM_GETLINK,

for links there also

RTM_SETLINK,

which need to be used to change existing link's attributes. NEWLINK requests with these attributes will be discarded (for me understanding this required adding prints to kernel...)

Actual request consist of:

1. message header type struct msghdr. I assume you're familiar with this standard message header.
2. netlink message header type struct nlmsghdr
3. request family specific header (struct ifaddrmsg / struct ifinfomsg / struct rtmsg)
4. set of family specific attributes

Netlink message header

struct nlmsghdr
{
    __u32     nlmsg_len; /* Length of message including header */
    __u16     nlmsg_type; /* Message content */
    __u16     nlmsg_flags;    /* Additional flags */
    __u32     nlmsg_seq; /* Sequence number */
    __u32     nlmsg_pid; /* Sending process port ID */
};


carries information necessary for
1. knowning the lenght of the message.
2. telling the type of the message (family)
3. flags telling how message should be handled - directly from /usr/include/linux/netlink.

/* Flags values */

NLM_F_REQUEST /* It is request message. */
NLM_F_MULTI /* Multipart message, terminated by NLMSG_DONE */
NLM_F_ACK /* Reply with ack, with zero or error code */
NLM_F_ECHO /* Echo this request */

/* Modifiers to GET request */
NLM_F_ROOT /* specify tree root */
NLM_F_MATCH /* return all matching */
NLM_F_ATOMIC /* atomic GET */
NLM_F_DUMP

/* Modifiers to NEW request */
NLM_F_REPLACE /* Override existing */
NLM_F_EXCL /* Do not touch, if it exists */
NLM_F_CREATE /* Create, if it does not exist */
NLM_F_APPEND /* Add to end of list */


for example every request should contain flag
NLM_F_REQUEST. Note that NLM_F_MATCH is propably not implemented, and you will get all routes/links/addresses that are specified - regardless the attributes / values in family specific struct. Flags are more accurately introduced for example in man pages.

3&4. identifying the request/responce pair. Typicaly you should set pid to something derived from process/thread id. I use
(pthread_self() < < 16|getpid());
Kernel's responces have pid set to 0.

You should propably use different sequence number for each request - that way you can associate correct responces with correct requests. I usually have global variable, and issue atomic incrementation operation to it when getting new sequence ID.

I also like to know if my request succeeded. Thus I always use NLM_F_ACK in other but GET requests. This will make kernel to send us reply with zero errorcode if requested operation succeeded. With GET requests I can expect getting reply anyways.

(The kernel's ACK message for successfull operation with NLM_F_ACK flag will be like:

struct nlmsghdr followed by struct nlmsgerr
where nlmsg_type == NLMSG_ERROR and error member of struct nlmsgerr is set to 0.
)

Family specific structs follow after the nlmsghdr. They're form:

Addresses:
struct ifaddrmsg

struct ifaddrmsg
{
    __u8        ifa_family;
    __u8        ifa_prefixlen; /* The prefix length        */
    __u8        ifa_flags; /* Flags            */
    __u8        ifa_scope; /* Address scope        */
    __u32     ifa_index; /* Link index         */
};


Family being address family (typically AF_INET / AF_INET6)
prefixlen telling the mask bits. (see explanation of mask in routes above. This is used when we for example add new ip-address to existing interface - so called alias IP address. Then the mask tells what network lies behind the door for outgoing packets!)
flags

/* ifa_flags */
IFA_F_SECONDARY
IFA_F_TEMPORARY

IFA_F_NODAD
IFA_F_OPTIMISTIC
IFA_F_DADFAILED
IFA_F_HOMEADDRESS
IFA_F_DEPRECATED
IFA_F_TENTATIVE
IFA_F_PERMANENT

whose meanings are quite unknown for me - I haven't really needed any of these.

ifa_scope
I've successfully used 0 as scope for all requests - I have no idea what the scope really is with addresses.

ifa_index (index number of the interface this address is bound, see man pages for
unsigned if_nametoindex(const char *ifname);


links(interfaces)
struct ifinfomsg

struct ifinfomsg
{
    unsigned char ifi_family;
    unsigned char __ifi_pad;
    unsigned short ifi_type;     /* ARPHRD_* */
    int     ifi_index;     /* Link index */
    unsigned    ifi_flags;     /* IFF_* flags */
    unsigned    ifi_change;     /* IFF_* change mask */
};

When I created new link (in order to do a VLAN interface - that was the only thing I used NEWLINK request for), I at first banged my head to wall by creating RTM_NEWLINK request, where ifinfomsg struct had fields filled... I always received error responces from kernel. When I finally compiled my own kernel, with lots of info prints included, I learned that the ifinfomsg struct should not contain much of values with NEWLINK request - or request would be discarded. Then I only filled ifi_change field with 0xffffffff and left rest of the fields to zero. Then I added attributes which specified the link to be VLAN interface. I'll show the attributes for requests later... This allowed me to create a new link.

After link was created, I used RTM_SETLINK request to set the link state to IFF_UP. For this step I filled ifi_index with index of newly created interface, and set flags to contain IFF_UP bit. ifi_change should still be 0xffffffff (at least according to man pages), since it is reserved for future use. I was lazy. I simply set the ifi_flags to be IFF_UP. I guess this approach may be hazardous (not sure though). It may be the interface has some other flags set up, and simply setting flags to be IFF_UP without knowing the original state may make us to lose some information. I do not know since I have not studied these flags so thoroughly. I just guess that better way could be first reading the flags, and then or (|) the IFF_UP with existing bits. However I guess making this so, that possible state changes between reading and setting the flags would be noticed - is hard. I just decided that there is no flags which I could accidentally zero... It is easier to not know ;)

routes
struct rtmsg

struct rtmsg
{
    unsigned char     rtm_family;
    unsigned char     rtm_dst_len;
    unsigned char     rtm_src_len;
    unsigned char     rtm_tos;

    unsigned char     rtm_table; /* Routing table id */
    unsigned char     rtm_protocol; /* Routing protocol; see below */
    unsigned char     rtm_scope; /* See below */
    unsigned char     rtm_type; /* See below    */

    unsigned        rtm_flags;
};


where types can be


RTN_UNSPEC,
RTN_UNICAST, /* Gateway or direct route */
RTN_LOCAL, /* Accept locally */
RTN_BROADCAST, /* Accept locally as broadcast,
send as broadcast */
RTN_ANYCAST, /* Accept locally as broadcast,
but send as unicast */
RTN_MULTICAST, /* Multicast route */
RTN_BLACKHOLE, /* Drop */
RTN_UNREACHABLE, /* Destination is unreachable */
RTN_PROHIBIT, /* Administratively prohibited */
RTN_THROW, /* Not in this table */
RTN_NAT, /* Translate this address */
RTN_XRESOLVE, /* Use external resolver */

For normal routes you propably want to use RTN_UNICAST. Although knowing the amount of spam flowing through the ethernet wires... Well, I feel the RTN_BLACKHOLE is tempting ;)

defined protocols:

#define RTPROT_UNSPEC 0
#define RTPROT_REDIRECT 1 /* Route installed by ICMP redirects;
not used by current IPv4 */
#define RTPROT_KERNEL 2 /* Route installed by kernel */
#define RTPROT_BOOT 3 /* Route installed during boot */
#define RTPROT_STATIC 4 /* Route installed by administrator */

/* Values of protocol > = RTPROT_STATIC are not interpreted by kernel;
they are just passed from user and back as is.
It will be used by hypothetical multiple routing daemons.
Note that protocol values should be standardized in order to
avoid conflicts.
*/

#define RTPROT_GATED 8 /* Apparently, GateD */
#define RTPROT_RA 9 /* RDISC/ND router advertisements */
#define RTPROT_MRT 10 /* Merit MRT */
#define RTPROT_ZEBRA 11 /* Zebra */
#define RTPROT_BIRD 12 /* BIRD */
#define RTPROT_DNROUTED 13 /* DECnet routing daemon */
#define RTPROT_XORP 14 /* XORP */
#define RTPROT_NTK 15 /* Netsukuku */
#define RTPROT_DHCP 16 /* DHCP client */


I used RTPROT_STATIC, which corresponds the situation where user adds a route using "ip route add" command.

possible scopes:

RT_SCOPE_UNIVERSE=0,
/* User defined values */
RT_SCOPE_SITE=200,
RT_SCOPE_LINK=253,
RT_SCOPE_HOST=254,
RT_SCOPE_NOWHERE=255

Well, for route which is not meant to stay inside some known system, it is feasible to use RT_SCOPE_UNIVERSE.


/* rtm_flags */

#define RTM_F_NOTIFY 0x100 /* Notify user of route change */
#define RTM_F_CLONED 0x200 /* This route is cloned */
#define RTM_F_EQUALIZE 0x400 /* Multipath equalizer: NI */
#define RTM_F_PREFIX 0x800 /* Prefix addresses */

I used no flags, eg set the flags to zero.


So at this spot, our request consists of netlink header telling the type of our request (address, link or route) and the type specific structure behind this netlink header. Now, so that it wouldn't be so simple, we'll introduce some more dynamic data :]


attributes

Each request supports variable amount of attributes which will further describe the address/route/interface being created (changed/deleted). Attributes are data prepended with struct rtattr structure. It looks like:

struct rtattr
{
    unsigned short rta_len;
    unsigned short rta_type;
};

After this structure we have actual attribute data, which of course depends on rta_type, and has lenght
rta_len - sizeof(struct rtattr). Aligned to 4 bytes :]

Well, as a C coder can imagine, this level of dynamic structures can be quite challenging to handle. And as can be guessed, there is macros written to ease the attribute handling. (I also wrote simpe functions to add attributes to the request.)

Macros to handle all this are:

RTA_ALIGN(len)
RTA_OK(rta,len)
RTA_NEXT(rta,attrlen)
RTA_LENGTH(len)
RTA_DATA(rta)
RTA_PAYLOAD(rta)

RTA_ALIGN(len) Rounds the given lenght to next alignment boundary. Eg. with typical 4 byte alignment

RTA_ALIGN(1) would return 4 as would RTA_ALIGN(4). RTA_ALIGN(5) would return 8 and so on.
RTA_OK(rta,len) can be used to see if the attribute is Ok. It is quite common to use RTA_NEXT with RTA_OK to parse incoming messages. Eg, with RTA_NEXT we get the next attribute, and with RTA_OK we check attribute is ok to be inspected. len passed to these macros is originally the lenght of attribute buffer. Each call to RTA_NEXT shall update the lenght.
RTA_LENGHT(len) returns the lenght which is required to store attribute which data is len bytes. Eg, RTA_LENGHT adds the lenght of rtattr header to the lenght of the data, and adds required padding bytes to get the data correctly aligned.
RTA_DATA(rta) returns pointer to the beginning of the data in attribute rta.
RTA_PAYLOAD(rta) returns the lenght of the data payload in attribute.

Sounds confusing? Well, don't be worried. I'll later show you a few functions to handle the attributes...

Now we can get to the confusing point...
With VLAN interface creation I hit my head to wall. I did some googling. Then I googled more. Finally I had googled my *** off. "VLAN interface netlink sockets", "RTM_NEWLINK VLAN". "Create VLAN via netlink". "VLAN netlink attribute". It basically took all my googling skills, as well as jump into kernel sources to finally find it. What kind of magical attribute allows one to specify VLAN interface. Google gave me some hints that such an attribute exists, and finally I found it...

nested attributes. Nested attributes are attributes, which have attributes inside. That's it. One of the attributes to create a VLAN interface is one of these. But I'll show that later.

Attributes supported by different families are:

Addresses:

IFA_UNSPEC,
IFA_ADDRESS,
IFA_LOCAL,
IFA_LABEL,
IFA_BROADCAST,
IFA_ANYCAST,
IFA_CACHEINFO,
IFA_MULTICAST,


Routes:


RTA_UNSPEC,
RTA_DST,
RTA_SRC,
RTA_IIF,
RTA_OIF,
RTA_GATEWAY,
RTA_PRIORITY,
RTA_PREFSRC,
RTA_METRICS,
RTA_MULTIPATH,
RTA_PROTOINFO, /* no longer used */
RTA_FLOW,
RTA_CACHEINFO,
RTA_SESSION, /* no longer used */
RTA_MP_ALGO, /* no longer used */
RTA_TABLE,


Interfaces (links):


IFLA_UNSPEC,
IFLA_ADDRESS,
IFLA_BROADCAST,
IFLA_IFNAME,
IFLA_MTU,
IFLA_LINK,
IFLA_QDISC,
IFLA_STATS,
IFLA_COST,
IFLA_COST
IFLA_PRIORITY,
IFLA_PRIORITY
IFLA_MASTER,
IFLA_MASTER
IFLA_WIRELESS, /* Wireless Extension event - see wireless.h */
IFLA_WIRELESS
IFLA_PROTINFO, /* Protocol specific information for a link */
IFLA_PROTINFO
IFLA_TXQLEN,
IFLA_TXQLEN
IFLA_MAP,
IFLA_MAP
IFLA_WEIGHT,
IFLA_WEIGHT
IFLA_OPERSTATE,
IFLA_LINKMODE,
IFLA_LINKINFO,
IFLA_LINKINFO
IFLA_NET_NS_PID,


Some of these attributes are documented in rtnetlink man pages at man section 7 - some aren't...



Now just to ease the pain for those who struggle with the VLAN interface setup... I managed to do it with following attributes:

IFLA_LINK, data size of int, contains the real interface which this VLAN interface uses beneath.
IFLA_IFNAME, name of the new VLAN interface, lenght depends on lenght of the name you give. I used naming convention < original_interface > . < vlan Id >

nested attribute IFLA_LINKINFO, containing:
attribute IFLA_INFO_KIND, data is string 'vlan', lenght being the lenght of the string + padding.
another nested attribute IFLA_INFO_DATA, containing:
attribute IFLA_VLAN_ID, which data is the vlan id.

Uh oh... Sounds confusing, right? Nested attribute containing an attribute and another nested attribute which contains an attribute... Oh joy, occasionally I tend to believe that programmers have been REALLY drunk when getting their ideas... (http://xkcd.com/323/)

Anyways, I'll show you the code which I used to handle this:


#define NLMSG_BOTTOM(nlmsg) ((struct rtattr *)(((void *)(nlmsg)) + NLMSG_ALIGN((nlmsg)- > nlmsg_len)))


static int addAttr(struct nlmsghdr *nl_req, int attrlabel, const void *data, int datalen)
{
    struct rtattr *attr=NLMSG_BOTTOM(nl_req));
    unsigned int attrlen=RTA_LENGTH(datalen); /* sizeof(struct rtattr) + datalen + align */
    if(NULL==nl_req || (datalen > 0 && NULL==data))
    {
        printf("NULL arg detected!");
        return -1;
    }
    attr- > rta_type=attrlabel;
    attr- > rta_len=attrlen;
    memcpy(RTA_DATA(attr),data,datalen);

    nl_req- > nlmsg_len=NLMSG_ALIGN(nl_req- > nlmsg_len)+RTA_ALIGN(attrlen);
    return 0;
}


static struct rtattr * addNestedAttr(struct nlmsghdr *nl_req, int attrlabel)
{
    struct rtattr *nested = NLMSG_BOTTOM(nl_req);

    if(!addAttr(nl_req, attrlabel, NULL, 0))
        return nested;
    return NULL;
}

static void endNestedAttr(struct nlmsghdr *nl_req, struct rtattr *nested)
{
    nested- > rta_len = (void *)NLMSG_BOTTOM(nl_req) - (void *)nested;
}


/* ...snip - Add attributes to the nlmsg msg */


        struct rtattr *attr1, *attr2;
        if(addAttr(msg,IFLA_LINK,&orig_ifindex,sizeof(int)))
        {
            printf("IFLA_LINK %d adding as rtattr to req failed!",orig_ifindex);
            retval=-1;
        }
        else if(addAttr(msg,IFLA_IFNAME,ifname,strlen(ifname)))
        {
            printf("IFLA_IFNAME %s adding as rtattr to req failed!",ifname);
            retval=-1;
        }
        else if(NULL==(attr1=addNestedAttr(msg,IFLA_LINKINFO)))
        {
            printf("addNestedAttr IFLA_LINKINFO FAILED!");
            retval=-1;
        }
        else if(addAttr(msg,IFLA_INFO_KIND,"vlan", strlen("vlan")))
        {
            printf("IFLA_INFO_KIND \"vlan\" adding FAILED!");
         retval=-1;
        }
        else if(NULL==(attr2=addNestedAttr(msg,IFLA_INFO_DATA)))
        {
            printf("addNestedAttr IFLA_INFO_DATA FAILED!");
            retval=-1;
        }
        else if(addAttr(msg,IFLA_VLAN_ID,&vlanid,sizeof(unsigned short)))
        {
            printf("IFLA_VLAN_ID %hu adding as rtattr to req failed!",vlanid);
            retval=-1;
        }
        else
        {
            endNestedAttr(msg,attr2);
            endNestedAttr(msg,attr1);
            printf("VLAN ID %hu, orig ifindex %d and new ifname %s added as attrs",vlanid,orig_ifindex,ifname);
        }




This code assumes that the lenght of the nlmsg (in struct nlmsghdr) is summed up during message creation. Eg, that when the attributes are added, the lenght in nlmsghdr is updated to be the lenght of message constructed this far. Attribute addition relies upon this lenght, when adding new attributes && updates this lenght when attributes are added. (see the macro NLMSG_BOTTOM() )




Now I guess I am approaching the end of this short introduction. I'll however show you something from where you can get the idea of how messages are sent and received, and how attributes can be parsed. In order to sum up the full horror of this interface (dynamic = > flexible and generic = > terribly hard to use) I have to mention something about receiving the messages...

Messages will arrive from socket the socket. They will be placed in buffer you gave. You need to be prepared to handle:

1. reply where you have specified to short buffer = > you'll get reply with MSG_TRUNC bit set.
2. Reply where you have multiple nlmsgs in one received buffer. In that case, there's NLM_F_MULTI flag set. In that case last message shall have NLM_F_DONE set.

Eg. You may end up having a buffer, where you have variable amount of nlmsgs, wach containing different sized/type struct and variable amount of possibly nested attributes after that... Some (pseudo)code as an example...

struct sockaddr_nl kernproc;
struct msghdr msg;
struct nlmsghdr *netlinkresp;
struct iovec iov;

memset(&msg,0,sizeof(msg));

memset(&kernproc,0,sizeof(kernproc));
memset(&iov,0,sizeof(iov));

kernproc.nl_family = AF_NETLINK;

msg.msg_name=(void *)&kernproc;
msg.msg_namelen=sizeof(kernproc);

netlinkresp= < buffer allocated for responce > ;

/* Add NLMSG_F_ACK if no reply is to be expected othervice */

iov.iov_base=(void *)netlinkresp;
iov.iov_len= < nlmsg_len > ;

msg.msg_iov=&iov;
msg.msg_iovlen = 1; /* only one iov struct abowe */
iov.iov_len= < size of the resp buffer >


retry:
    retval=recvmsg(sock, &msg, 0);
    if(0 > =retval)
    {
        if(errno==EINTR)
            goto retry;
        else
        {
            printf("Error when receiving from netlink sock!");
            /* handle error */
        }
    }
    else
    {
        /* ...reply received */
    }




So
1. Check lenght of received message from return value of recv. Never exeed it.
2. Check the msghdr (not nlmsghdr) to see the message was not truncated.
if(msg.msg_flags&MSG_TRUNC)
... allocate more space for resp and retry...
3. strore pointer to the nlmsghdr message header.
4. start a loop and use NLMSG_OK() to see message is ok. If msg is not OK, then you have nothing to handle.
5. Check the type of nlmsg, and lenght. If lenght is greater or equal to type specific header, then use
NLMSG_DATA to get the actual message. Cast and store a pointer to this.
5. Check the received message for information you longed for. If whole nlmsg lenght is still not handled, then there probably are attributes.
6. Obtain ptr to first attribute by adding size of message specific struct to the NLMSG_DATA().
7. start a loop and check the attribute with RTA_OK() If RTA_OK fails go to step 9
8. check attribute type, and data according to type & len.
obtain next attr with RTA_NEXT - > end loop and go back to step 7.
9. when last attribute is handled, check if NLMSG had NLM_F_MULTI set, and at least not NLMSG_DONE was specified, then get next NLMSG with NLMSG_NEXT() and loop again from step 4



Sending a message is done using same generic iovec mechanism. Eg:

    struct sockaddr_nl kernproc;
    struct msghdr msg;
    struct nlmsghdr *netlinkreq;
    struct iovec iov;

    memset(&msg,0,sizeof(msg));this- > mypid
            
    memset(&kernproc,0,sizeof(kernproc));
    memset(&iov,0,sizeof(iov));

    kernproc.nl_family = AF_NETLINK;

    msg.msg_name=(void *)&kernproc;
    msg.msg_namelen=sizeof(kernproc);

    netlinkreq= < pointer to allocated and filled message request > ;

    netlinkreq- > nlmsg_pid= (pthread_self() < < 16|getpid());
    netlinkreq- > nlmsg_seq= atomicallyIncrementSeqId(seqid);
    
    #ifdef debug
    debugprint_msg(netlinkreq);
    #endif

    iov.iov_base=(void *)netlinkreq;
    iov.iov_len=netlinkreq- > nlmsg_len;

    msg.msg_iov=&iov;
    msg.msg_iovlen = 1; /* only one iov struct abowe */

    retval =sendmsg(sock,&msg,0);
    if(retval < =0)
    {
        printf("sendmsg() FAILED!");

    }
    return retval;




the debugprint function I have used contains following code:


    if(netlinkreq- > nlmsg_flags & NLM_F_ACK)
    {
        printf("Msg contains f_ack!");
    }
    if(!NLMSG_OK(netlinkreq,netlinkreq- > nlmsg_len))
    {
        printf("Looks like we're sending invalid nlmsg!! NLMSG_OK() == false at send!");
    }
    else
    {
        printf
        (
            "sending NLMSG: len %u, type %hu, flags %hu, seq %u pid %u",
            netlinkreq- > nlmsg_len,
            netlinkreq- > nlmsg_type,
            netlinkreq- > nlmsg_flags,
            netlinkreq- > nlmsg_seq,
            netlinkreq- > nlmsg_pid
        );
        switch(netlinkreq- > nlmsg_type)
        {
            case RTM_NEWROUTE:
            case RTM_DELROUTE:
            case RTM_GETROUTE:
            {
                printf
                (
                    "Req is route req (new %u, del %u, get %u)",
                    RTM_NEWROUTE,
                    RTM_DELROUTE,
                    RTM_GETROUTE
                );
                printf
                (
                    "family %u, dstlen %u, srclen %u, tos %u, table %u, proto %u, scope %u, type %u, flags %u",
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_family,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_dst_len,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_src_len,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_tos,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_table,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_protocol,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_scope,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_type,
                    (unsigned int)((struct rtmsg *) NLMSG_DATA(netlinkreq) )- > rtm_flags
                );
                {
                    int len=netlinkreq- > nlmsg_len;
                    struct rtattr *at=(struct rtattr *)((char *)NLMSG_DATA(netlinkreq)+sizeof(struct rtmsg));
                    while(NULL!=at && RTA_OK(at,len))
                    {
                        char tmp[100];
                        switch(at- > rta_type)
                        {
                            case RTA_DST:
                                printf
                                (
                                    "dst is set to %s",
                                    inet_ntop
                                    (
                                        (at- > rta_len > 8)?AF_INET6:AF_INET,
                                        RTA_DATA(at),
                                        tmp,
                                        100
                                    )
                                );
                                break;
                            case RTA_SRC:
                                 printf
                                (
                                    "src is set to %s",
                                    inet_ntop
                                    (
                                        (at- > rta_len > 8)?AF_INET6:AF_INET,
                                        RTA_DATA(at),
                                        tmp,
                                        100
                                    )
                                );
                                break;
                            case RTA_GATEWAY:
                                printf
                                (
                                    "gw is set to %s",
                                    inet_ntop
                                    (
                                        (at- > rta_len > 8)?AF_INET6:AF_INET,
                                        RTA_DATA(at),
                                        tmp,
                                        100
                                    )
                                );
                                break;
                            case RTA_OIF:
                                printf
                                (
                                    "OIF is set to %u",
                                    *(unsigned int *)RTA_DATA(at)
                                );
                                break;
                            case RTA_PRIORITY:
                                printf
                                (
                                    "Priority is set to %u",
                                    *(unsigned int *)RTA_DATA(at)
                                );
                                break;
                            default:
                                printf("rta_type %u, len %u",at- > rta_type,at- > rta_len);
                                break;
                        }
                        at=RTA_NEXT(at,len);
                    }
                }
                break;
            }
            case RTM_NEWADDR:
            case RTM_GETADDR:
            case RTM_DELADDR:
                printf
                (
                    "Req is ADDR req (new %u, del %u, get %u)",
                    RTM_NEWADDR,
                    RTM_DELADDR,
                    RTM_GETADDR
                );
                printf
                (
                    "ifa_family %u, ifa_prefixlen %u, ifa_flags %u, ifa_scope %u, ifa_index %d",
                    (unsigned int)((struct ifaddrmsg *) NLMSG_DATA(netlinkreq) )- > ifa_family,
                    (unsigned int)((struct ifaddrmsg *) NLMSG_DATA(netlinkreq) )- > ifa_prefixlen,
                    (unsigned int)((struct ifaddrmsg *) NLMSG_DATA(netlinkreq) )- > ifa_flags,
                    (unsigned int)((struct ifaddrmsg *) NLMSG_DATA(netlinkreq) )- > ifa_scope,
                    (int)((struct ifaddrmsg *) NLMSG_DATA(netlinkreq) )- > ifa_index
                );
                {
                    int len=netlinkreq- > nlmsg_len;
                    struct rtattr *at=(struct rtattr *)((char *)NLMSG_DATA(netlinkreq)+sizeof(struct ifaddrmsg));
                    while(NULL!=at && RTA_OK(at,len))
                    {
                        char tmp[100];
                        switch(at- > rta_type)
                        {
                            case IFA_ADDRESS:
                                printf
                                (
                                    "IFA_ADDRESS is set to %s",
                                    inet_ntop
                                    (
                                        (at- > rta_len > 8)?AF_INET6:AF_INET,
                                        RTA_DATA(at),
                                        tmp,
                                        100
                                    )
                                );
                                break;
                            case IFA_LOCAL:
                                printf
                                (
                                    "IFA_LOCAL is set to %s",
                                    inet_ntop
                                    (
                                        (at- > rta_len > 8)?AF_INET6:AF_INET,
                                        RTA_DATA(at),
                                        tmp,
                                        100
                                    )
                                );
                                break;
                            case IFA_BROADCAST:
                                printf
                                (
                                    "IFA_BROADCAST is set to %s",
                                    inet_ntop
                                    (
                                        (at- > rta_len > 8)?AF_INET6:AF_INET,
                                        RTA_DATA(at),
                                        tmp,
                                        100
                                    )
                                );
                                break;
                            case IFA_LABEL:
                                printf
                                (
                                    "IFA_LABEL is set to '%s'",
                                    (char *)RTA_DATA(at)
                                );
                                break;
                            case IFA_ANYCAST:
                                printf
                                (
                                    "IFA_ANYCAST is set to %s",
                                    inet_ntop
                                    (
                                        (at- > rta_len > 8)?AF_INET6:AF_INET,
                                        RTA_DATA(at),
                                        tmp,
                                        100
                                    )
                                );
                                break;
                            default:
                                printf("rta_type %u, len %u",at- > rta_type,at- > rta_len);
                                break;
                        }
                        at=RTA_NEXT(at,len);
                    }
                }
                break;
.
./* You could add other type of requests here too */
        }
.
.
.


There you see how browsing through one message contents can be done. One loop more with some extra checks, and NLMSG_NEXT and NLMSG_OK macros would allow you to go through received nlmsgs.





Anyways, I guess I can end my brief introduction to netlink sockets here. Remeber to check the
man 7 rtnetlink and man 3 netlink for more information. Here is just some pieces like nested attributes and VLAN interface setup explained... Maybe you do not need to bang your head to wall just as much as I had to do. :]


Oh, and if you find this post or the code examples usefull, please do leave me a note :) And as allways, code presented here can be used/modified to suit your purposes, as long as you either drop me a note in Mazziesaccount@gmail.com or comment in here, and mention original author (me, Maz) && this site in your codes - especially if you publish them to be public somewhere.

Have fun!

12 comments:

  1. Hello! I read this post, and want ask you to help me this Netlink. The sourse code not change size of MTU. Why?
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main()
    {
    static char errbuf[256];
    int x;
    scanf("%d",&x);
    struct
    {
    struct nlmsghdr nh;
    struct ifinfomsg ifa;
    char attrbuf[512];
    } req;
    struct rtattr *rta;
    struct sockaddr_nl addr;
    memset(&addr, 0, sizeof(addr));
    addr.nl_family = AF_NETLINK;
    addr.nl_groups = RTMGRP_LINK | RTMGRP_IPV4_IFADDR | RTMGRP_IPV4_ROUTE;
    unsigned int mtu = x;
    int rtnetlink_sk = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
    bind(rtnetlink_sk,(struct sockaddr *)&addr, sizeof(addr));
    memset(&req, 0, sizeof(req));
    req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
    req.nh.nlmsg_flags = NLM_F_REPLACE;
    req.nh.nlmsg_type = RTM_NEWLINK;
    req.ifa.ifi_type = 1;
    req.ifa.ifi_family = AF_UNSPEC;
    req.ifa.ifi_index = if_nametoindex("eth1");
    req.ifa.ifi_change = 0xFFFFFFFF;
    rta = (struct rtattr *)(((char *) &req) + NLMSG_ALIGN(req.nh.nlmsg_len));
    rta->rta_type = IFLA_MTU;
    rta->rta_len = RTA_LENGTH(sizeof(unsigned int));
    req.nh.nlmsg_len = NLMSG_ALIGN(req.nh.nlmsg_len) + RTA_LENGTH(sizeof(mtu));
    memcpy(RTA_DATA(rta), &mtu, sizeof(mtu));
    int i = send(rtnetlink_sk, &req, req.nh.nlmsg_len,0);
    printf("%d\n",if_nametoindex("eth0"));
    printf("%d\n",i);
    return 0;
    }

    ReplyDelete
  2. Greetings! We with my friends write the system which part will operate a network. It has been decided to use NetLink. It is a lot of good stuff on the Internet wasn't, and it is few experts. I assign the big hope that you can help to find to me a problem in a code from an official example in man.Он doesn't work and doesn't change value MTU

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main()
    {
    static char errbuf[256];
    int x;
    scanf("%d",&x);
    struct
    {
    struct nlmsghdr nh;
    struct ifinfomsg ifa;
    char attrbuf[512];
    } req;
    struct rtattr *rta;
    struct sockaddr_nl addr;
    memset(&addr, 0, sizeof(addr));
    addr.nl_family = AF_NETLINK;
    addr.nl_groups = RTMGRP_LINK | RTMGRP_IPV4_IFADDR | RTMGRP_IPV4_ROUTE;
    unsigned int mtu = x;
    int rtnetlink_sk = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
    bind(rtnetlink_sk,(struct sockaddr *)&addr, sizeof(addr));
    memset(&req, 0, sizeof(req));
    req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
    req.nh.nlmsg_flags = NLM_F_REPLACE;
    req.nh.nlmsg_type = RTM_NEWLINK;
    req.ifa.ifi_type = 1;
    req.ifa.ifi_family = AF_UNSPEC;
    req.ifa.ifi_index = if_nametoindex("eth1");
    req.ifa.ifi_change = 0xFFFFFFFF;
    rta = (struct rtattr *)(((char *) &req) + NLMSG_ALIGN(req.nh.nlmsg_len));
    rta->rta_type = IFLA_MTU;
    rta->rta_len = RTA_LENGTH(sizeof(unsigned int));
    req.nh.nlmsg_len = NLMSG_ALIGN(req.nh.nlmsg_len) + RTA_LENGTH(sizeof(mtu));
    memcpy(RTA_DATA(rta), &mtu, sizeof(mtu));
    int i = send(rtnetlink_sk, &req, req.nh.nlmsg_len,0);
    printf("%d\n",if_nametoindex("eth0"));
    printf("%d\n",i);
    //i = recv(rtnetlink_sk, &req, req.nh.nlmsg_len,0);
    //if (i<0) sprintf(errbuf, "Error Receiving Netlink Message: %.128s", strerror(errno));
    // if (i < sizeof(struct nlmsghdr)) sprintf(errbuf, "Received Short Netlink Message: %d", i);
    // if (i < req.nh.nlmsg_len) sprintf(errbuf, "Netlink Message Size Error: %d < %d",i, req.nh.nlmsg_len);
    //if (req.nh.nlmsg_type == NLMSG_ERROR)
    //{
    // struct nlmsgerr *err = (struct nlmsgerr *) NLMSG_DATA(&req.nh);
    // sprintf(errbuf, "Netlink Error: %.128s",strerror(-err->error));
    // }
    // printf("%d\n",i);
    printf("hello world\n");
    return 0;
    }

    ReplyDelete
  3. Do you receive reply from netlink? If you do, do you have an errorcode? Does kernel log give any hints about what's wrong?

    And let me guess. You try to change existing link, not to create new one? You might want to use RTM_SETLINK instead of NEWLINK.

    Experts? I am by no means an expert.

    ReplyDelete
  4. Thanks a lot for this article!

    ReplyDelete
    Replies
    1. No problem! Hopefully it helped although this is quite incomplete..

      Delete
  5. If I am trying to create a virtual interface in linux, should I set all the attributes of ifinfomsg to 0(except ifi_change, as you suggested) ?

    ReplyDelete
  6. If I remeber this correctly, then: You need 2 messages. First a RTM_NEWLINK message with empty ifinfo and correct attributes. Then RTM_SETLINK message with flags to set the interface state up.

    ReplyDelete
  7. Thanks a lot for this article. I just want to know create new vlan interfaces on Linux kernel? what is the difference between RTM_NEWLINK and RTM_NEWROUTE usage? If I just use "vconfig add vlan name vlan id" .... How to code it? should I use RTM_NEWLINK? please give me an example...

    ReplyDelete
  8. Good write up Maz, thank you!

    At this point, I can't really help but wonder why the netlink designers did not encapsulate the messages into more coherent structures that are easy to handle: both to send and receive. I mean, besides the beer peak of course :)

    ReplyDelete
  9. Good write up Maz, thank you!

    At this point, I can't really help but wonder why the netlink designers did not encapsulate the messages into more coherent structures that are easy to handle: both to send and receive. I mean, besides the beer peak of course :)

    ReplyDelete
  10. Hi Mehmet,

    Actually I can understand the choice. The TLV (type-value-length) representation is quite widely used in networking world. The benefits are at least the flwxibility. TLV makes skipping unsupported data easy. You can always add new features (types of data) without breaking the old code. If code encounters data with unknown type, it can simply skip the length of data and ignore it. So you can build new features without braking old functionality.

    Also making parser for TLV data is not too hard. You can use common parser for all of the types, and only give the data to protocol aware handler when it is already parsed. Also order of attributes does not matter.

    So basically I'd say the idea is good, but lack of documentation and generic userspace functions for parsing data is what makes netlink usage hardish. I know we have libnl, but I feel it has even worse documentation. I guess usage and understanding libnl (documentation) requires that you know the underlying netlink functionality...

    If we look the kernel part of netlink (or genetlink - maybe I should try writing something more about this) we see how easy and efficient the netlink can be. Kernel facilities for parsing and constructing netlink messages make netlink damn easy and tempting option for kernelspace <-> userspace communication. After all, bidirectional, reliable, synchronous/ asynchronous socket interface supporting multicasts... What more could one ask for? There's absolutely no limitations to what it can be used. It is the amount of code needed in userspace what makes this all so frustrating...

    --Maz

    ReplyDelete
  11. Correction - TLV is of course type-lenght-value, not type-value-lenght =D

    ReplyDelete