非同期関数呼び出し

std:.async は、非同期関数呼び出しのように感じます。内部では std::async はタスクです。 1 つは非常に使いやすいです。

std::async

std::async は作業パッケージとして callable を取得します。この例では、関数、関数オブジェクト、またはラムダ関数です。

// async.cpp

#include <future>
#include <iostream>
#include <string>

std::string helloFunction(const std::string& s){
 return "Hello C++11 from " + s + ".";
}

class HelloFunctionObject{
 public:
 std::string operator()(const std::string& s) const {
 return "Hello C++11 from " + s + ".";
 }
};

int main(){

 std::cout << std::endl;

 // future with function
 auto futureFunction= std::async(helloFunction,"function");

 // future with function object
 HelloFunctionObject helloFunctionObject;
 auto futureFunctionObject= std::async(helloFunctionObject,"function object");

 // future with lambda function
 auto futureLambda= std::async([](const std::string& s ){return "Hello C++11 from " + s + ".";},"lambda function");

 std::cout << futureFunction.get() << "\n" 
 << futureFunctionObject.get() << "\n" 
 << futureLambda.get() << std::endl;

 std::cout << std::endl;

}

プログラムの実行はそれほどエキサイティングではありません。

future は、関数 (23 行目)、関数オブジェクト (27 行目)、およびラムダ関数 (30 行目) を取得します。最後に、各 Future はその値を要求します (32 行目)。

繰り返しますが、もう少しフォーマルです。 23、27、30 行目の std::async 呼び出しは、2 つのエンドポイントの future と promise の間にデータチャネルを作成します。 promise はすぐにその作業パッケージの実行を開始します。ただし、これはデフォルトの動作にすぎません。 get 呼び出しにより、フューチャーはワークパッケージの結果を要求します

Eager または Lazy 評価

Eager または Lazy Evaluation は、式の結果を計算するための 2 つの直交する戦略です。熱心な評価の場合、式はすぐに評価されます。遅延評価の場合、式は必要な場合にのみ評価されます。多くの場合、遅延評価は call-by-need と呼ばれます。遅延評価を使用すると、疑いのある評価がないため、時間を節約し、計算能力を節約できます。式は、数学計算、関数、または std::async 呼び出しにすることができます。

デフォルトでは、std::async はその作業パッケージをすぐに実行しました。 C++ ランタイムは、計算が同じスレッドで行われるか新しいスレッドで行われるかを決定します。フラグ std::launch::async を使用すると、std::async はその作業パッケージを新しいスレッドで実行します。それに対して、フラグ std::launch::deferred は、std::async が同じスレッドで実行されることを表します。この場合、実行は遅延します。つまり、熱心な評価はすぐに開始されますが、future が get 呼び出しで値を要求すると、ポリシー std::launch::deferred を使用した遅延評価が開始されます。

プログラムは異なる動作を示します。

// asyncLazy.cpp

#include <chrono>
#include <future>
#include <iostream>

int main(){

 std::cout << std::endl;

 auto begin= std::chrono::system_clock::now();

 auto asyncLazy=std::async(std::launch::deferred,[]{ return std::chrono::system_clock::now();});

 auto asyncEager=std::async( std::launch::async,[]{ return std::chrono::system_clock::now();});

 std::this_thread::sleep_for(std::chrono::seconds(1));

 auto lazyStart= asyncLazy.get() - begin;
 auto eagerStart= asyncEager.get() - begin;

 auto lazyDuration= std::chrono::duration<double>(lazyStart).count();
 auto eagerDuration= std::chrono::duration<double>(eagerStart).count();

 std::cout << "asyncLazy evaluated after : " << lazyDuration << " seconds." << std::endl;
 std::cout << "asyncEager evaluated after: " << eagerDuration << " seconds." << std::endl;

 std::cout << std::endl;

}

両方の std::async 呼び出し (13 行目と 15 行目) は現在の時点を返します。しかし、最初の呼び出しは怠惰で、2 番目の呼び出しは貪欲です。 17 行目の 1 秒の短いスリープは、それを明確に示しています。 19 行目の asyncLazy.get() の呼び出しにより、短い仮眠の後に結果が利用可能になります。これは、asyncEager には当てはまりません。 asyncEager.get() は、すぐに実行された作業パッケージから結果を取得します。

より大きなコンピューティングジョブ

std::async は、より大きな計算ジョブをより多くの肩にかけるのに非常に便利です。したがって、スカラー積の計算は、4 つの非同期関数呼び出しを使用してプログラムで行われます。

// dotProductAsync.cpp

#include <chrono>
#include <iostream>
#include <future>
#include <random>
#include <vector>
#include <numeric>

static const int NUM= 100000000;

long long getDotProduct(std::vector<int>& v, std::vector<int>& w){

 auto future1= std::async([&]{return std::inner_product(&v[0],&v[v.size()/4],&w[0],0LL);});
 auto future2= std::async([&]{return std::inner_product(&v[v.size()/4],&v[v.size()/2],&w[v.size()/4],0LL);});
 auto future3= std::async([&]{return std::inner_product(&v[v.size()/2],&v[v.size()*3/4],&w[v.size()/2],0LL);});
 auto future4= std::async([&]{return std::inner_product(&v[v.size()*3/4],&v[v.size()],&w[v.size()*3/4],0LL);});

 return future1.get() + future2.get() + future3.get() + future4.get();
}


int main(){

 std::cout << std::endl;

 // get NUM random numbers from 0 .. 100
 std::random_device seed;

 // generator
 std::mt19937 engine(seed());

 // distribution
 std::uniform_int_distribution<int> dist(0,100);

 // fill the vectors
 std::vector<int> v, w;
 v.reserve(NUM);
 w.reserve(NUM);
 for (int i=0; i< NUM; ++i){
 v.push_back(dist(engine));
 w.push_back(dist(engine));
 }

 // measure the execution time
 std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
 std::cout << "getDotProduct(v,w): " << getDotProduct(v,w) << std::endl;
 std::chrono::duration<double> dur = std::chrono::system_clock::now() - start;
 std::cout << "Parallel Execution: "<< dur.count() << std::endl;

 std::cout << std::endl;

}

このプログラムは、random and time ライブラリの機能を使用します。どちらのライブラリも C++11 の一部です。 2 つのベクトル v と w が作成され、行 27 ～ 43 で乱数が入力されます。各ベクトルは (行 40 ～ 43) 1 億の要素を取得します。 41 行目と 42 行目の dist(engine) は乱数を生成し、0 から 100 の範囲で一様に分布しています。スカラー積の現在の計算は関数 getDotProduct で行われます (12 行目から 20 行目)。 std::async は、標準テンプレートライブラリアルゴリズム std::inner_product を内部的に使用します。 return ステートメントは先物の結果を要約します。

私の PC で結果を計算するのに約 0.4 秒かかります。

しかし今、問題はです。プログラムを 1 つのコアで実行した場合、プログラムはどれくらい高速ですか?関数 getDotProduct を少し変更すると、真実がわかります。


long long getDotProduct(std::vector<int>& v,std::vector<int>& w){
 return std::inner_product(v.begin(),v.end(),w.begin(),0LL);
}

プログラムの実行は 4 倍遅くなります。

最適化

しかし、GCC で最大の最適化レベル O3 でプログラムをコンパイルすると、パフォーマンスの違いはほとんどなくなります。並列実行は約 10% 高速です。

次は?

次の投稿では、std::packaged_task を使用して大規模な計算ジョブを並列化する方法を紹介します。 (校正者アレクセイ エリマノフ )

非同期関数呼び出し

std::async

Eager または Lazy 評価

より大きなコンピューティング ジョブ

最適化

次は?

より大きなコンピューティングジョブ