IBM Introduces CodeNet for Programming Language Translation

IBM has announced Project CodeNet, a large-scale research dataset aimed at helping teach AI to code.

“Computer scientists have been long fascinated by the possibility of computers programming computers,” according to the announcement, but the problem is not easily solved. If, for example, programming language translation were easy, legacy languages like COBOL would have been converted to modern alternatives by now. But, programming languages have context and complexity that go beyond a straightforward rules-based translation approach.

Project CodeNet, however, “can drive algorithmic innovation to extract this context with sequence-to-sequence models,... to make a more significant dent in machine understanding of code as opposed to machine processing of code.” 

Currently, the Project CodeNet dataset includes approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems, and about 500 million lines of code in more than 50 programming languages, ranging from COBOL, Pascal, and FORTRAN to C++, Java, Python, and Go.