Introduction Continuing from the previous article, you learned about conditional probability and the Bayes’ Rule. You also learned about how to go about implementing the same in machine learning using C# and initiate the first steps to a malware classifier for the Android platform. In this article, we go about fine tuning the data set and the trying to understand what might be the best ways to improve the detection rates while reducing on the false positives. Furthermore, we also take a good look at spam mail classification and go about a similar process for classifying spam mails. Android Classification So far, we have the following options from malware and benign sample sets.
1260 samples permissions histogram, rank and probabilities from the Malware Genome Project paper. Permissions count and file sizes based on comparison parameters are taken into account. Folder structure depth in both sets can be a good factor, which has to be boiled down to a single number. Finally, the comparison between the manifest permissions and the actually used permissions can give us an excellent idea of the kind of activities the files perform. In addition, the Stoway research data set gives us a pretty comprehensive listing of API’s and their respective permissions required for their execution. This data set can be simply consulted to build a profile in addition to the score ratings from the Bayesian component. Strings are indeed an important data pool in any malware analysis session and hence a separate set of string-based analysis algorithms can be incorporated. Things like non-standard character sequences, URL strings, high entropy strings sequences, base64 strings, and encryption keyscan also be taken into account.
Opcodes level analysis can also be done using Bayes rule, which would make use of the most probable opcode sequences from a corpus of disassembly. This would entail finding patterns from the opcodes extracted from the instruction arrays referencing each method and finding the statistical significance of the frequency of their occurrence,which would boil down to probability numbers for each opcode sequence which has to be further curated to maximize efficacy.
As of now, we will focus on getting maximum leverage from the permissions and file attributes. Is the Bayes Rule an exact science? The answer would be no. The performance can only be related to the available data set from which the best parameters have to be selected. Any anomalies or outliers have to be carefully taken into account so that the result is not skewed. On the contrary, this might also involve generalizing things or ignoring certain possibilities to make the model stand on good enough prediction rates without dealing with exceptions. The Bayes rule ultimately relies on the statistical properties of a data set and the accuracy of the data set to begin with. Machine learning takes from both statistics and data mining, which are of course related and hence overlapping approaches, are quite the norm. It is largely based on the observation of collected data and inferring a model based on the statistical properties of the data set. Most definitely not rocket science for our purposes. Back to the premise, with the existing data sets a C# class is written that encapsulates the probability value of each permission occurring from the top 20 permissions list, which can be easily found out by making a listing of each permission set of a malware and saved in a vector. After which, a simple frequency count is done on the vector list for each of the permissions. Finally, the top 20 counts are listed along with their counts. Dividing that number with the total of corpus and you get the probability values for each of them. A vector is simply a row where in each parameter (permission) is represented in a separate column. Each row thus represents the total permissions in that particular file. This representation is helpful for data inference and other statistical operations. This representation also lends itself to Matrix maths and things like dimensionality reduction can be done with finesse using matrices. Dimensionality reduction is wherein a number of columns in our current vector format are collapsed so as to minimize the number of dimensions. Herein, 2 or more columns are combined to a single value. This is very helpful when trying to visualise the data. We will also add additional values obtained from the file itself and embellish the vector set.
A problem with the current parameter set would be the inclusion or exclusion of certain permissions that don’t fall in the permissions list. Ideally, the list of all available permissions is mandatory for a perfectly accurate data set. However, such data is not reliably available for many of the android API’s and is also a current research goal. In such cases, the Bayes rule calculation would involve incorporating the probability values of such new or non-data set existent values. In our case, the probability values of such terms are very small statistically or else their values could have been computed simply from the data set itself. There are quite a few ways to deal with them. One method would involve setting a probability value for all such new values in a general sense, so that is we don’t know what the actual occurrence value is and are just taking a best guess. A very small value between zero and one has to set. Multiplying the probability values with a zero would result in an entire score having the final value zero. So the best we can do with this approach is to set it to something minuscule, such as 0.00001. This would ensure a non-zero value. Again, the data sets analysis has to be re-run to see the effect on the final scores. In our case, this method does no good as the values go towards denormal numbers, which denote very small numbers if the values of such constants are small enough and the permissions list does not account for every single permission, especially the non-top-20 ones. So any such outlier will drastically reduce our existing score and thus register as a zero if the denormal numbers are not taken care of separately. The other method would be to simply ignore them. This indeed makes sense for our purposes because we also have other corroborating data and are not relying exclusively on manifest data only to make our final decision. I have a training set of 1824 malwares currently and the obtained vectors for each malware gives a value range like the one shown below –an excerpt taken from 912 malwares from the training set. This is without affecting the prior probability values for the malware and the benign score respectively that is –the prior probability of both scores is taken as 1. Furthermore, each column is sorted individually so as to give a better statistical view of the vector set. This is an example of supervised machine learning, as we have expected outputs of the training set (i.e. we know what are their parameters and that they are either malware or benign). In unsupervised learning, there are no expected output samples for a training set and the algorithm is expected to find a pattern to give us a resulting output. Perm- Count File Size Malware Bayes Score Benign Bayes Score Benign training data set excerpt – Perm- Count File Size Malware Bayes Score Benign Bayes Score Getting the fundamental statistical properties of each column I get: The concept of quartiles helps us understand the data value below which a certain percentage of the sample set lies. A quartile is the median value, taken for every 25% of the data set – to illustrate this using a sample set. Taking a regular median value on any data gives us the 50% quartile value/ 2nd value or the value below which 50% of the data set lies. Now take the remaining data set below 50% and recalculate the median value of that set, this would give us the 25% quartile or the first quartile value. Repeat for the upper data set from the 50% median value of the original and complete data set and you get the 3rd quartile or 75% value. The 0 and the 4th gives us the min and max values respectively. In our case, the values effectively give us an estimate numeric below which half the data set and 75% of the data set lie. Furthermore, the average values also give a good indication of the general value around which the range deviates. Doing a training run on a set of 370 legitimate android apps gives(excerpt of data set below)- A few things immediately standout: The permissions count vector summary for malware denotes that the average number is around 7. Whereas the similar count on about 400 games applications are around 2. Zero could indicate both malware and benign and thus would have to be ignored in the final decision, while taking other parameters into account. While the malware Bayes score in the malware training set and the benign training set are comparable in terms of ranges, the benign Bayes score in the benign training set has a much larger range than that of the malware training set. This range check could be used in the final threat score. File sizes are not the best judge of any set affinity, though the average sizes of malwares seem to be much smaller than legitimate apps (though currently the benign set only has the most downloaded games). The 3rd quartile score for the malware set is significantly higher than that of the benign set mal score. The benign set 3rd quartile score is significantly higher than that of the malware set benign score. Finally, the prior probability can be tuned to values other than benign set prior =0.8 and malware set prior =0.2. This effectively sets the probability of being a malware a chance of 20 out of 100arbitrary samples. This is, again, a statistical inference taken out of a best guess, given that the number of android malware is growing by the day, this value will have to be updated frequently. Individually, none of these values on their own would indicate anything out of the ordinary but in tandem should give us a better view of the application profile. An important point to note is that the probability scores by themselves are quite skewed in that the malicious Bayes scores have a tendency to come to a higher value than the benign Bayes score. Hence, a simple value comparison might not work at this stage. In fact, most apps will flag it as malware, which might be playing it too safe. Our focus is, therefore, to get a better detection rate for legitimate apps. Once our classifier can make that decision well enough we are done with the current process. Threat Rating Threat rating can be done using a weighted flag based score system. For every such flag being hit, a certain cumulative score is assigned. This can be rule based for now on the data gathered above. Later on, intelligent pattern matching can be used to further embellish this detection rule set. Rules draft –
If permission count is above 4 hit flag.[MALWARE] If permission count is less than 4 hit flag.[BENIGN] If benign score >=0.000313. [BENIGN] If benign score <0.000313and benign score is not less than 2.12E-14.[BENIGN] If benign score >=3.33943E-20and malware score <=0.000219157[MALWARE] If malware score <=7.94E-04and benign score <=0.000300 [BENIGN] If benign score is <=4.04471E-06and malware score<=0.000219157 [MALWARE]
If size >=4978015 [BENIGN]
Recoding the setScore() function in the bayesClassifier.cs class –
[csharp] publicvoid setScore(List
On the malware test set.
On the benign test set.
In this case, the detection rates have to be read in context. For the above malware training and test sets, the rate of detection we are interested in is the first line – mal percent detection rate, noting that this includes both malware confirmed and suspicious files as well. In both cases, the detection rates are above 60%, which is fair but not exactly production ready. We can surely tune the data set to increase the value. Keep in mind that suspicious files were found quite in number and therefore additional fail safes can be triggered to keep that in very good check. However, our other high priority task is to greatly minimize the false positive rates. Looking at the benign training set, the classifier has correctly identified more than 80% of genuine android apps. This is also fair in the sense that 14 percent have been flagged as malware/suspicious.Again, taking into account the number of suspicious files for which deeper checking can be done, the percentage can be greatly improved. The reset benign test run for separate triggering of suspicious apps checking yields:
At this stage, the Bayes classifier is working and a second phase of tuning followed by a failsafe set of checking algorithms will surely bolster the detection rates while keeping false positives to a minimum. I would encourage you to take steps in that direction and share your findings as well. Conclusion You have gained an understanding of the application of fundamental statistics and probability and applied the process practically to classify between benign and malware android apps. Much of this and a little more is being implemented in my custom tool SCIENT – an Android Malware Analyser, itself a work in progress. Building an AV engine is always a challenge, and given the odds, a signature based approach is no longer as effective as it was in the old days. Heuristics and profiling based engines that aim to mimic the human cognitive processes or rely on statistical mechanics have a better chance at survival if enough effort is taken to keep the algorithms tuned and running. Mobile platforms need something resilient and compact and relying on signatures seems to be a losing battle, especially given the current crop and quantity of mobile malware. Similar techniques can be implemented for other platforms as well. Keeping in mind that larger sample sets have a better chance of getting the modelling data, its best to be patient until the data sets do the talking themselves, especially with machine learning-based approaches. We will take a similar approach and overview towards spam mail classification in the next article.