2020年3月16日 星期一

Python 資料讀寫方法比較與資料壓縮比較 (二) 多進程

上一回邊寫時就在思考多進程是否能夠幫助提升資料讀寫的速度。

答案是否定的,為什麼?

首先是資料大小,資料規模不到一定程度,光是多進程的準備時間就輸了一大截。

接著是沒有考慮到資料轉傳,測試皆採取原地寫入原地讀出,未經壓縮的資料寫入比較快,同時讀出也比較快,即便它的容量可能是上百倍甚至更多,省下處理的時間就有優勢。

多進程的優勢在於啟動更多的 cpu 計算能力,對於 壓縮 與 解壓縮 需要用到 cpu 計算的,如果需要越多計算,多進程的優勢就會能體現。如 lzma 。

此外,將越多任務交給多進程處例,有助於提升速度。例如把資料json化、bytes化、壓縮化及解壓縮等盡可能在各進程上執行,主線程只是發包資料到各進程去。

雖然讀取也可以多進程(搭配多進程寫入成不同檔案),但是往往輸給主線程直接一一讀取,因為多進程的讀取結果最後還是得回到一個主線程上顯示,這使得多進程讀取最後還是會回到單核上,失去了多核的優勢。

多進程在寫入的時候,可以不等待寫完,就讓主線程繼續到讀取階段。如果資料量一大,會碰到讀取時多進程卻沒有寫完而報錯,解決方法是在多進程讀取中做判斷。由於這是寫入時偷跑,看似寫入時間少很多是假象,少掉的時間會跑到讀取時間。

mp開頭代表使用多進程,bj 和 zl 和 xz 代表主要方法,後面  j  與 z 代表多進程執行 json 與 zip 。「n」代表不等待寫完的  no wait。

像是 mpzlzjn,使用 mp 多進程 ,主要方法為 zl(Json Zlib) ,每個進程 zip 壓縮 json 化且 no wait 寫完就繼續到讀取,就有優於 json zlib單核與其他多核但多核任務量不同的速度。

當使用多進程的時候,在運行中修改目標函式的代碼,也會直接反應到目前運行的當中,這表示多進程的是從實體取出目標函式。這是我在運行中目標函示中添加sleep時,直接觀察到數據有立即的反應。

多進程是使用上也有一定的局限,使用上來說不是那麼直接。如果沒有容量限制,也沒有轉傳的需要,用 bstr json 是最好的方法。

bstr json 是將資料 bytes(json.dumps(data), encoding = "acsii") 用 'wb' 寫入 f,再用 'rb' 開啟 f 以 json.loads(f) 讀出。跟 str json 的差別在於寫入 f 時是用 'w' 而 bstr json 是 'wb'。

如果需要轉傳,使用 mpzljzn 會是非常棒的方法,光是每秒 1 MB,用 bstr json 就不知道要傳多久。而 mpxzz系列只能在可憐的(< 1MB/s)傳輸速度上有優於 mpzljzn,考量到方便性,用 zl 而不使用多核也不錯。

""" 100萬筆資料寫讀
data = [(i,"阿明", "身體狀況", {"體重":78.9, "肝功能": "正常"}, 90, 85, 97, 81,  85, True) for i in range(100000)]
[Statistics]
Method: bstr json, Compress Ratio: 0.06:1
Average Use Time: avgWtime:2.46935, avgRtime:3.14803, avgUtime:5.61739
Data size: 8697464(8.49MB), Wspeed:3.43961 MB/s, Rspeed:2.69807 MB/s, Uspeed:1.51202 MB/s
File size: 141888890(138.56MB), Wspeed:56.11319 MB/s, Rspeed:44.01585 MB/s, Uspeed:24.66687 MB/s
Method: json zlib, Compress Ratio: 2.79:1
Average Use Time: avgWtime:2.35145, avgRtime:3.29203, avgUtime:5.64348
Data size: 8697464(8.49MB), Wspeed:3.61208 MB/s, Rspeed:2.58005 MB/s, Uspeed:1.50503 MB/s
File size: 3117715(3.04MB), Wspeed:1.29480 MB/s, Rspeed:0.92485 MB/s, Uspeed:0.53950 MB/s
Method: json xz, Compress Ratio: 71.11:1
Average Use Time: avgWtime:14.73717, avgRtime:3.62832, avgUtime:18.36548
Data size: 8697464(8.49MB), Wspeed:0.57634 MB/s, Rspeed:2.34093 MB/s, Uspeed:0.46248 MB/s
File size: 122316(0.12MB), Wspeed:0.00811 MB/s, Rspeed:0.03292 MB/s, Uspeed:0.00650 MB/s
Method: mpbj, Compress Ratio: 0.06:1
Average Use Time: avgWtime:2.60734, avgRtime:3.34684, avgUtime:5.95418
Data size: 8697464(8.49MB), Wspeed:3.25758 MB/s, Rspeed:2.53781 MB/s, Uspeed:1.42650 MB/s
File size: 141888890(138.56MB), Wspeed:53.14356 MB/s, Rspeed:41.40131 MB/s, Uspeed:23.27163 MB/s
Method: mpbjn, Compress Ratio: 0.06:1
Average Use Time: avgWtime:2.19224, avgRtime:3.98733, avgUtime:6.17957
Data size: 8697464(8.49MB), Wspeed:3.87440 MB/s, Rspeed:2.13015 MB/s, Uspeed:1.37447 MB/s
File size: 141888890(138.56MB), Wspeed:63.20630 MB/s, Rspeed:34.75091 MB/s, Uspeed:22.42282 MB/s
Method: mpbjj, Compress Ratio: 0.06:1
Average Use Time: avgWtime:2.05410, avgRtime:3.13490, avgUtime:5.18900
Data size: 8697464(8.49MB), Wspeed:4.13496 MB/s, Rspeed:2.70937 MB/s, Uspeed:1.63685 MB/s
File size: 141888890(138.56MB), Wspeed:67.45707 MB/s, Rspeed:44.20021 MB/s, Uspeed:26.70329 MB/s
Method: mpbjjn, Compress Ratio: 0.06:1
Average Use Time: avgWtime:0.06299, avgRtime:4.78647, avgUtime:4.84947
Data size: 8697464(8.49MB), Wspeed:134.83090 MB/s, Rspeed:1.77450 MB/s, Uspeed:1.75145 MB/s
File size: 141888890(138.56MB), Wspeed:2199.60744 MB/s, Rspeed:28.94895 MB/s, Uspeed:28.57290 MB/s
Method: mpzl, Compress Ratio: 2.95:1
Average Use Time: avgWtime:2.93715, avgRtime:3.43182, avgUtime:6.36897
Data size: 8697464(8.49MB), Wspeed:2.89179 MB/s, Rspeed:2.47496 MB/s, Uspeed:1.33359 MB/s
File size: 2950294(2.88MB), Wspeed:0.98093 MB/s, Rspeed:0.83954 MB/s, Uspeed:0.45237 MB/s
Method: mpzlz, Compress Ratio: 2.79:1
Average Use Time: avgWtime:2.57715, avgRtime:3.46488, avgUtime:6.04203
Data size: 8697464(8.49MB), Wspeed:3.29574 MB/s, Rspeed:2.45134 MB/s, Uspeed:1.40575 MB/s
File size: 3118214(3.05MB), Wspeed:1.18159 MB/s, Rspeed:0.87886 MB/s, Uspeed:0.50399 MB/s
Method: mpzlzn, Compress Ratio: 2.79:1
Average Use Time: avgWtime:2.19316, avgRtime:3.75349, avgUtime:5.94665
Data size: 8697464(8.49MB), Wspeed:3.87278 MB/s, Rspeed:2.26286 MB/s, Uspeed:1.42830 MB/s
File size: 3118214(3.05MB), Wspeed:1.38847 MB/s, Rspeed:0.81128 MB/s, Uspeed:0.51207 MB/s
Method: mpzljz, Compress Ratio: 2.79:1
Average Use Time: avgWtime:2.07069, avgRtime:3.32371, avgUtime:5.39440
Data size: 8697464(8.49MB), Wspeed:4.10182 MB/s, Rspeed:2.55547 MB/s, Uspeed:1.57452 MB/s
File size: 3117371(3.04MB), Wspeed:1.47019 MB/s, Rspeed:0.91594 MB/s, Uspeed:0.56435 MB/s
Method: mpzljzn, Compress Ratio: 2.79:1
Average Use Time: avgWtime:0.04629, avgRtime:4.76512, avgUtime:4.81140
Data size: 8697464(8.49MB), Wspeed:183.49081 MB/s, Rspeed:1.78246 MB/s, Uspeed:1.76531 MB/s
File size: 3117371(3.04MB), Wspeed:65.76732 MB/s, Rspeed:0.63887 MB/s, Uspeed:0.63273 MB/s
Method: mpxz, Compress Ratio: 71.11:1
Average Use Time: avgWtime:14.85270, avgRtime:4.02111, avgUtime:18.87381
Data size: 8697464(8.49MB), Wspeed:0.57186 MB/s, Rspeed:2.11226 MB/s, Uspeed:0.45002 MB/s
File size: 122316(0.12MB), Wspeed:0.00804 MB/s, Rspeed:0.02971 MB/s, Uspeed:0.00633 MB/s
Method: mpxzz, Compress Ratio: 69.15:1
Average Use Time: avgWtime:7.99850, avgRtime:2.42350, avgUtime:10.42200
Data size: 8697464(8.49MB), Wspeed:1.06190 MB/s, Rspeed:3.50469 MB/s, Uspeed:0.81497 MB/s
File size: 125776(0.12MB), Wspeed:0.01536 MB/s, Rspeed:0.05068 MB/s, Uspeed:0.01179 MB/s
Method: mpxzzj, Compress Ratio: 71.62:1
Average Use Time: avgWtime:6.89620, avgRtime:3.59600, avgUtime:10.49220
Data size: 8697464(8.49MB), Wspeed:1.23164 MB/s, Rspeed:2.36197 MB/s, Uspeed:0.80952 MB/s
File size: 121444(0.12MB), Wspeed:0.01720 MB/s, Rspeed:0.03298 MB/s, Uspeed:0.01130 MB/s
Method: mpxzzjn, Compress Ratio: 71.62:1
Average Use Time: avgWtime:0.04867, avgRtime:8.94961, avgUtime:8.99828
Data size: 8697464(8.49MB), Wspeed:174.53017 MB/s, Rspeed:0.94905 MB/s, Uspeed:0.94392 MB/s
File size: 121444(0.12MB), Wspeed:2.43699 MB/s, Rspeed:0.01325 MB/s, Uspeed:0.01318 MB/s
"""