Comparison of Protein Corpuses

Wedajo Diribi and Kumudha Raimond

Abstract—This paper presents a comparison of two protein corpuses. The protein corpus is a data set of four files for evaluating the performance of protein compression algorithms. Although past studies reported compression rates of protein sequences with varying degrees of success, there are wrongly stated claims and confusing results in some standard publications arising from inappropriate comparison of the data sets. To emphasize the difference and similarity of the data sets, the content of the files in the two protein corpuses are compared with respect to the size in bytes and repetitions of amino acids. In addition, comparison is made based on difficulty of compressing the files in the corpuses. The results indicate that the two protein corpuses possess different regularities. Besides, nine general purpose compression algorithms outperform the results reported by biological compressors on one of the corpus and comparable results on the other corpus.

Index Terms—Protein corpus, compression rate, protein compression, biological compressors, general purpose compressor.

