Category Archives: C/C++

Building OPs in Tensorflow

In order to accelerate the machine learning operations, the implementation of C++ is preferred than python. In tensorflow, we can compile C++ and call the function in python runtime. The way is shown in [1]. First, we need to get the header directory of tensorflow to compile our code.

Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.sysconfig.get_include()

The complication is done like this

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')

g++ -std=c++11 -shared -o -fPIC -I $TF_INC -O2

We use word2vec [2] as another example to show the flow

class SkipgramWord2vecOp : public OpKernel {
  explicit SkipgramWord2vecOp(OpKernelConstruction* ctx)
      : OpKernel(ctx), rng_(&philox_) {
    string filename;
    OP_REQUIRES_OK(ctx, ctx->GetAttr("filename", &filename));
    OP_REQUIRES_OK(ctx, ctx->GetAttr("batch_size", &batch_size_));
    OP_REQUIRES_OK(ctx, ctx->GetAttr("window_size", &window_size_));
    OP_REQUIRES_OK(ctx, ctx->GetAttr("min_count", &min_count_));
    OP_REQUIRES_OK(ctx, ctx->GetAttr("subsample", &subsample_));
    OP_REQUIRES_OK(ctx, Init(ctx->env(), filename));

    mutex_lock l(mu_);
    example_pos_ = corpus_size_;
    label_pos_ = corpus_size_;
    label_limit_ = corpus_size_;
    sentence_index_ = kSentenceSize;
    for (int i = 0; i < kPrecalc; ++i) {
      NextExample(&precalc_examples_[i].input, &precalc_examples_[i].label);

  void Compute(OpKernelContext* ctx) override {
    Tensor words_per_epoch(DT_INT64, TensorShape({}));
    Tensor current_epoch(DT_INT32, TensorShape({}));
    Tensor total_words_processed(DT_INT64, TensorShape({}));
    Tensor examples(DT_INT32, TensorShape({batch_size_}));
    auto Texamples = examples.flat<int32>();
    Tensor labels(DT_INT32, TensorShape({batch_size_}));
    auto Tlabels = labels.flat<int32>();
      mutex_lock l(mu_);
      for (int i = 0; i < batch_size_; ++i) {         Texamples(i) = precalc_examples_[precalc_index_].input;         Tlabels(i) = precalc_examples_[precalc_index_].label;         precalc_index_++;         if (precalc_index_ >= kPrecalc) {
          precalc_index_ = 0;
          for (int j = 0; j < kPrecalc; ++j) {
      words_per_epoch.scalar<int64>()() = corpus_size_;
      current_epoch.scalar<int32>()() = current_epoch_;
      total_words_processed.scalar<int64>()() = total_words_processed_;
    ctx->set_output(0, word_);
    ctx->set_output(1, freq_);
    ctx->set_output(2, words_per_epoch);
    ctx->set_output(3, current_epoch);
    ctx->set_output(4, total_words_processed);
    ctx->set_output(5, examples);
    ctx->set_output(6, labels);

  struct Example {
    int32 input;
    int32 label;

  int32 batch_size_ = 0;
  int32 window_size_ = 5;
  float subsample_ = 1e-3;
  int min_count_ = 5;
  int32 vocab_size_ = 0;
  Tensor word_;
  Tensor freq_;
  int64 corpus_size_ = 0;
  std::vector<int32> corpus_;
  std::vector<Example> precalc_examples_;
  int precalc_index_ = 0;
  std::vector<int32> sentence_;
  int sentence_index_ = 0;

  mutex mu_;
  random::PhiloxRandom philox_ GUARDED_BY(mu_);
  random::SimplePhilox rng_ GUARDED_BY(mu_);
  int32 current_epoch_ GUARDED_BY(mu_) = -1;
  int64 total_words_processed_ GUARDED_BY(mu_) = 0;
  int32 example_pos_ GUARDED_BY(mu_);
  int32 label_pos_ GUARDED_BY(mu_);
  int32 label_limit_ GUARDED_BY(mu_);

  void NextExample(int32* example, int32* label) EXCLUSIVE_LOCKS_REQUIRED(mu_) {

  Status Init(Env* env, const string& filename) {
    return Status::OK();

REGISTER_KERNEL_BUILDER(Name("SkipgramWord2vec").Device(DEVICE_CPU), SkipgramWord2vecOp);

namespace tensorflow {

    .Output("vocab_word: string")
    .Output("vocab_freq: int32")
    .Output("words_per_epoch: int64")
    .Output("current_epoch: int32")
    .Output("total_words_processed: int64")
    .Output("examples: int32")
    .Output("labels: int32")
    .Attr("filename: string")
    .Attr("batch_size: int")
    .Attr("window_size: int = 5")
    .Attr("min_count: int = 5")
    .Attr("subsample: float = 1e-3")

} // end namespace tensorflow

To compile,

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
g++ -std=c++11 -shared -o -fPIC -I $TF_INC -O2 -D_GLIBCXX_USE_CXX11_ABI=0, the word2vec C++ can be used as follows,

import tensorflow as tf 

word2vec = tf.load_op_library(os.path.join(os.path.dirname(os.path.realpath(__file__)), ''))

(words, counts, words_per_epoch, current_epoch, total_words_processed,
     examples, labels) = word2vec.skipgram_word2vec(filename=opts.train_data,
(opts.vocab_words, opts.vocab_counts,
opts.words_per_epoch) =[words, counts, words_per_epoch])

The question is why we can use “skipgram_word2vec”.
The answer is in the page [1] about naming in the tensorflow wrapper.

For example,

    .Input("string_tensor: string")
    .Output("output: out_type")
    .Attr("out_type: {float, int32} = DT_FLOAT");
Converts each string in the input Tensor to the specified numeric type.

“A note on naming: Inputs, outputs, and attrs generally should be given snake_case names. The one exception is attrs that are used as the type of an input or in the type of an input. Those attrs can be inferred when the op is added to the graph and so don’t appear in the op’s function. For example, this last definition of ZeroOut will generate a Python function that looks like:”

def string_to_number(string_tensor, out_type=None, name=None):
  """Converts each string in the input Tensor to the specified numeric type.

    string_tensor: A `Tensor` of type `string`.
    out_type: An optional `tf.DType` from: `tf.float32, tf.int32`.
      Defaults to `tf.float32`.
    name: A name for the operation (optional).

    A `Tensor` of type `out_type`.

Therefore, we have the following lines for skipgram_word2vec

    .Output("vocab_word: string")

will generate
def skipgram_word2vec(...):


The code is from Thanks to the upstream tensorflow/models. And nealwu‘s help at github

There is also a Chinese blog talking about Tensorflow’s OPs.

[1] Tensorflow Document

sort list via selection sort

struct ListNode {
    int val;
    ListNode *next;
    ListNode(int x) : val(x), next(NULL) {}

class Solution {
    ListNode *insertionSortList(ListNode *head){
      if (head == NULL || head->next == NULL) return head;

      ListNode *cur = head->next;
      head->next = NULL; // set up the termination condition
      while (cur){
          ListNode *nxt = cur->next;
          ListNode * brk = head;

          if (cur->val < brk->val) {
              cur->next = brk;
              head = cur;
          else {
               while (brk && brk->next && brk->next->val <= cur->val){
                         brk = brk->next;
               cur->next = brk->next; // propagate the NULL, a smart move!
               brk->next = cur;
          cur = nxt;
      return head;



[OJ report] binary tree right side view

The solution is at solution @ github .

The idea is to use DFS to do pre-order traversal of the mirror tree. At the same time, use an array to mark current level, so that we can keep from putting wrong element into the result vector.

class Solution {
   void Helper( TreeNode * root, vector<int> & result,
                    int cur, vector<int> & marked ){
     if(root == NULL) return;
    // pre-order of the mirror binary tree
    // (DFS & just by swapping the traversal order)
    if( result.size() == cur ){
      // maintain a marked vector to keep from putting
      // wrong element into the result vector
    Helper(root->right, result , cur+1, marked);
    Helper(root->left , result , cur+1, marked);
  vector<int> rightSideView(TreeNode * root){
    vector<int> result, marked;
    if(root == NULL) return result;
    Helper(root, result, 0, marked);
    return result;

The runtime complexity is O(n), where n is the node number.

The worst-case memory complexity is O(n) = O(n) .

Eigen for exponential matrix

Useful link

(Implementation and theoretical details) Nicholas J. Higham, “The scaling and squaring method for the matrix exponential revisited,”SIAM J. Matrix Anal. Applic., 26:1179–1193, 2005.

g++ a.cpp -o a -I$(EIGENPackage), where EIGENPackage = where your eigen source code

#include <unsupported/Eigen/MatrixFunctions>
#include <iostream>
using namespace Eigen;
int main()
  const double pi = std::acos(-1.0);
  MatrixXd A(3,3);
  A << 0,    -pi/4, 0,
       pi/4, 0,     0,
       0,    0,     0;
  std::cout << "The matrix A is:\n" << A << "\n\n";
  std::cout << "The matrix exponential of A is:\n" << A.exp() << "\n\n";

strcpy traps me

Be aware of strcpy function has requirement for different char *, which is strcpy(char *, const char *). If you do something like strcpy(p,p+1) is dangerous, why not just use p++. Otherwise, some of compiler will have the issue shown at this link .

Implementation mistake causes memory leakage when build Sparse Matrix in mex C/C++ to interact with MATLAB

Let’s first take a look at 6 Lessons From Dropbox – One Million Files Saved Every 15 Minutes, we know, when the product scales, engineers tend to use more computational efficient (C/C++) languages to compensate the side effect of rapid prototyping/development (Python) language.

Similar thing happens in academic prototyping, MATLAB is convenient, and more worthwhile to be used to explore novel idea and save time. However, some important parts are required to implemented in more efficient way to make the idea more solid. Therefore, programming and interacting programming languages is a handy skill set to computer science students.  When we deal with those situation, memory management and transferring is always pain.

In this article, I program in C with MEX to talk to MATLAB. I mistakenly transfer the sparse matrix format with redundant allocations so it blows up the memory (PS: I deal with large scale computing problem).

Do not use this memory Blow-Up version code:

plhs[0] = mxCreateSparse( c_stamp->M, c_stamp->N, c_stamp->NNZ, mxREAL);
cPr = (double *)mxMalloc(sizeof(double) * (c_stamp->NNZ));
cIr = (mwIndex *)mxMalloc(sizeof(mwIndex) * (c_stamp->NNZ));
cJc = (mwIndex *)mxMalloc(sizeof(mwIndex) * (c_stamp->N + 1));

//copy memory
memcpy( cPr, c_stamp->Pr, sizeof(double) * (c_stamp->NNZ));
memcpy( cIr, (mwIndex *)(c_stamp->Ir), sizeof(mwIndex) * (c_stamp->NNZ));
memcpy( cJc, (mwIndex *)(c_stamp->Jc), sizeof(mwIndex) * (c_stamp->N + 1));

mxSetPr(plhs[0], cPr);
mxSetIr(plhs[0], cIr);
mxSetJc(plhs[0], cJc);

Instead, this version is right

plhs[0] = mxCreateSparse( c_stamp->M, c_stamp->N, c_stamp->NNZ, mxREAL);
cPr = (double *) mxGetPr(plhs[0]);
memcpy(cPr, c_stamp->Pr, sizeof(double)*(c_stamp->NNZ));
cIr = (mwIndex *) mxGetIr(plhs[0]);
memcpy( cIr, (mwIndex *)(c_stamp->Ir), sizeof(mwIndex) * (c_stamp->NNZ));
cJc = (mwIndex *) mxGetJc(plhs[0]);
memcpy( cJc, (mwIndex *)(c_stamp->Jc), sizeof(mwIndex) * (c_stamp->N + 1));

where cPr, cIr, cJc are defined before. c_stamp is a sparse matrix I want to wrap from C/C++ to MATLAB via mex.