张一极
date:20220825-22:42
关键词:分层抽样
实现关于数据集分层抽样算法,从初始样本开始,设置len(类别)个篮子,每次假设数据放入某个篮子中,计算放置后的数据分布,放置完成以后,进入下一个样本尝试。
直到某一个类别达到样本比例,即可停止放置,再次循环到下一个样本,开始下一个类别的放置。
601def auto_split(input_path = "/backup/datasets/new_HD/new_HD/labels/",output_path = "/backup/datasets/new_HD/new_HD/",train_percent=0.7,val_percent = 0.3):2 labels_list = os.listdir(input_path)3 dict_classes = {}4 labels_count = 05 obj_count = 06 for label in labels_list:7 labels_count+=18 f = open(input_path+label, encoding = 'utf-8')9 label = f.read()10 splited_label = label.split("\n")11 for _ in splited_label:12 obj_name = _.split(" ")[0]13 if len(obj_name) == 1:14 if obj_name in dict_classes: 15 if dict_classes[obj_name] != "0": 16 dict_classes[obj_name] = int(dict_classes[obj_name])+117 else:18 dict_classes[obj_name] = 019 train_objs = {key:0 for key in dict_classes}20 val_objs = {key:0 for key in dict_classes}21 test_objs = {key:0 for key in dict_classes}22 train_obj_dict_expect = get_train_obj_dict(dict_classes,train_percent)23 val_obj_dict_expect = get_val_obj_dict(dict_classes,val_percent)24 train_obj_dict_now = {}25 val_obj_dict_now = {}26 count_train_sample = 027 count_val_sample = 028 for sample in labels_list:29 flag_train_added = 0 30 sample_info = get_the_yolo_labels(input_path+sample)31 for obj_ in sample_info:32 if obj_.split(" ")[0] != '':33 class_name = obj_.split(" ")[0] 34 if class_name in train_obj_dict_now: 35 if train_obj_dict_now[class_name] != "0":36 # train_obj_dict_now[class_name]+=137
38 if train_obj_dict_now[class_name]+1 >= train_obj_dict_expect[class_name]:39 pass40 else:41 train_obj_dict_now[class_name]+=142 flag_train_added = 143 else:44 train_obj_dict_now[class_name] = 045 if flag_train_added:46 write_txt(output_path+"train_list.txt",sample)47 count_train_sample+=148 else:49 write_txt(output_path+"val_list.txt",sample)50 count_val_sample+=151 for obj_ in sample_info:52 if obj_.split(" ")[0] != '':53 class_name = obj_.split(" ")[0] 54 if class_name in val_obj_dict_now:55 if val_obj_dict_now[class_name] != "0":56 val_obj_dict_now[class_name]+=157 else:58 val_obj_dict_now[class_name] = 059 print("trainset objs distribution : ",train_obj_dict_now)60 print("valset objs distribution : ",val_obj_dict_now)最后可以得到一个较为均衡的数据分布数据集。