Update README.md

fd7214d verified 25 days ago

8.7 kB

	---
	library_name: peft
	license: apache-2.0
	base_model: Qwen/Qwen3-32B
	tags:
	- axolotl
	- generated_from_trainer
	datasets:
	- ctitools/orion_10k
	model-index:
	- name: neurocti-qwen3-32b-orion10k-instruct-fp16
	results: []
	pipeline_tag: text2text-generation
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)

	This model has been finetuned to perform a single task well: generating hunting rules from malware or threat actor TTP descriptions. As such you need to construct your prompts like this to get the expected result:

	Generate a [YARA/SIGMA] rule for the following:\n [malware description]

	Here is an example output including thinking token:

	========================================
	<\|im_start\|>user
	Generate a YARA rule for the following:
	Exfiltrated browser login data and informational messages are written to `C:\ProgramData\file.txt`
	and copied to `%LOCALAPPDATA%\Packages\Bay0NsQIzx\p.txt` when stealing operations have
	completed. If found, targeted browser extensions and wallets have their directories copied to
	`%LOCALAPPDATA%\Packages\Bay0NsQIzx`, and a Telegram message is sent indicating the number of
	crypto wallets found. The contents of `%LOCALAPPDATA%\Packages\Bay0NsQIzx` are subsequently
	compressed into an archive named `output.zip`, located in the same directory. The archive is then
	exfiltrated to the Telegram bot and a secondary C2 endpoint hosted at wetransfers[.]io/uplo.php, as
	shown in Figure 7 . The domain `wetransfers[.]io` was registered on February 18, 2025, via
	NameCheap, Inc., and is currently hosted behind Cloudflare infrastructure.<\|im_end\|>
	<\|im_start\|>assistant
	<think>
	Okay, I need to create a YARA rule based on the provided information about malware file paths and strings. Let me start by carefully reading through the details given.

	The user mentioned specific files used for storing stolen data:
	- Original log file: C:\ProgramData\file.txt
	- Copied log file: %LOCALAPPDATA%\Packages\Bay0NsQIzx\p.txt

	Then there's an output zip file created from the Bay0NsQIzx folder:
	- Archive path: %LOCALAPPDATA%\Packages\Bay0NsQIzx\output.zip (but stored in the same dir? Or maybe destination?)

	Also mentioned exfiltration URLs:
	- http://wetransfers.io/uplo.php (suspicious domain name similar to WeTransfer)

	Strings related to stealer activities might include phrases like "Chrome Extension", "MetaMask", "Telegram" commands or methods. Also look for hardcoded domains/paths like 'uplo.php', t
	he domain itself.

	For the YARA rule:

	Metadata:
	Title: Stealer Output File Names / Paths
	Author: cti.tools
	Description: Detects characteristic filenames and paths used by a stealer, including its output ZIP and upload URL string.
	Reference: AI generated rule from the ORKL10K dataset

	Condition:
	Check if any of these strings appear in the file, likely in code sections related to file saving or network requests.

	Need to add the exact file names and paths where possible. For example:
	- Check for the creation of p.txt in that Packages directory
	- Detection of output.zip being created in that location
	- Presence of the uplo.php URL string

	Potential rules could focus on:
	- Specific file names and paths used for temporary storage/output
	- Hardcoded C2/Upload server URL strings
	- Strings indicative of targeting wallets/extensions (less reliable alone but useful with other indicators)

	So looking at this info, I'll build rules focused on detecting the unique file paths or the domain string used for uploading.
	</think>

	rule MetaStealer_FilePaths_Indicators {
	meta:
	author = "cti.tools"
	description = "Detects MetaStealer based on unique file paths or upload URL."
	reference = "AI generated rule from the ORKL10K dataset"
	strings:
	$fp_original_log = "C:\\ProgramData\\file.txt" ascii wide
	$fp_copied_log = "%LOCALAPPDATA%\\Packages\\Bay0NsQIzx\\p.txt" ascii wide
	$fp_output_zip = "%LOCALAPPDATA%\\Packages\\Bay0NsQIzx\\output.zip" ascii wide
	$upload_url = "http://wetransfers.io/uplo.php" ascii wide
	condition:
	uint16(0) == 0x5A4D and filesize < 2MB and ($fp_original_log or $fp_copied_log or $fp_output_zip or $upload_url)
	}<\|im_end\|>


	<details><summary>See axolotl config</summary>

	axolotl version: `0.10.0.dev0`
	```yaml

	base_model: Qwen/Qwen3-32B
	# Automatically upload checkpoint and final model to HF
	hub_model_id: ctitools/neurocti-qwen3-32b-orion10k-instruct-fp16

	plugins:
	- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
	strict: false

	chat_template: qwen3
	datasets:
	- path: ctitools/orion_10k
	type: chat_template
	field_messages: messages
	split: train[:1%]
	message_property_mappings:
	role: role
	content: content
	roles:
	user:
	- user
	assistant:
	- assistant

	val_set_size: 0.01
	output_dir: ./outputs/out
	dataset_prepared_path: last_run_prepared

	sequence_len: 8192
	sample_packing: true
	eval_sample_packing: false
	pad_to_sequence_len: true

	#load_in_4bit: false
	#load_in_8bit: true
	adapter: lora
	lora_r: 32
	lora_alpha: 64
	lora_target_modules:
	- q_proj
	- k_proj
	- v_proj
	- o_proj
	- down_proj
	- up_proj
	lora_mlp_kernel: true
	lora_qkv_kernel: true
	lora_o_kernel: true

	bf16: auto
	tf32: true

	wandb_project: neurocti-hunting
	wandb_entity: aaronkaplan
	wandb_watch:
	wandb_name: neurocti-hunting_lora_neurocti-qwen3-32b-orion10k-instruct-fb16-r32-lr0.0001-sl8192-e3-v1
	wandb_log_model:

	gradient_accumulation_steps: 2
	micro_batch_size: 1
	num_epochs: 3
	#optimizer: adamw_torch_4bit
	optimizer: adamw_torch
	lr_scheduler: cosine
	learning_rate: 0.0001

	gradient_checkpointing: offload
	gradient_checkpointing_kwargs:
	use_reentrant: false
	resume_from_checkpoint:
	logging_steps: 1
	flash_attention: true

	warmup_steps: 10
	evals_per_epoch: 4
	saves_per_epoch: 1
	weight_decay: 0.0
	special_tokens:

	# multi-gpu setups:
	deepspeed: deepspeed_configs/zero2.json


	```

	</details><br>

	# neurocti-qwen3-32b-orion10k-instruct-fp16

	The [NeuroCTI](https://github.com/ctitools/neurocti-hunting) Cybersecurity Cyber Threat Intelligence (CTI) model was fine-tuned on the orion10k dataset.
	The purpose of this model is to help in automatically generating YARA detection rules from unstructured CTI reports.

	This model is a fine-tuned version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) on the ctitools/orion_10k dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.4578

	## Model description

	The [NeuroCTI](https://github.com/ctitools/neurocti-hunting) Cybersecurity Cyber Threat Intelligence (CTI) model was fine-tuned on the orion10k dataset.
	The purpose of this model is to help in automatically generating YARA detection rules from unstructured CTI reports.


	## Intended uses & limitations

	Experimental use for now

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0001
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 4
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 8
	- total_eval_batch_size: 4
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 10
	- num_epochs: 3.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 1.6419 \| 0.1667 \| 1 \| 6.5215 \|
	\| 1.8202 \| 0.3333 \| 2 \| 6.5088 \|
	\| 1.6672 \| 0.6667 \| 4 \| 6.2647 \|
	\| 1.6093 \| 1.0 \| 6 \| 5.1988 \|
	\| 1.5079 \| 1.3333 \| 8 \| 3.3091 \|
	\| 1.1391 \| 1.6667 \| 10 \| 1.6507 \|
	\| 0.9673 \| 2.0 \| 12 \| 1.0461 \|
	\| 0.9455 \| 2.3333 \| 14 \| 0.6177 \|
	\| 0.7946 \| 2.6667 \| 16 \| 0.4784 \|
	\| 0.7557 \| 3.0 \| 18 \| 0.4578 \|


	### Framework versions

	- PEFT 0.15.2
	- Transformers 4.51.3
	- Pytorch 2.6.0+cu124
	- Datasets 3.5.1
	- Tokenizers 0.21.1