참고
Click here to download the full example code
Ray Tune을 이용한 하이퍼파라미터 튜닝¶
번역: 심형준
하이퍼파라미터 튜닝은 보통의 모델과 매우 정확한 모델간의 차이를 만들어 낼 수 있습니다. 종종 다른 학습률(Learnig rate)을 선택하거나 layer size를 변경하는 것과 같은 간단한 작업만으로도 모델 성능에 큰 영향을 미치기도 합니다.
다행히, 최적의 매개변수 조합을 찾는데 도움이 되는 도구가 있습니다. Ray Tune 은 분산 하이퍼파라미터 튜닝을 위한 업계 표준 도구입니다. Ray Tune은 최신 하이퍼파라미터 검색 알고리즘을 포함하고 TensorBoard 및 기타 분석 라이브러리와 통합되며 기본적으로 Ray 의 분산 기계 학습 엔진 을 통해 학습을 지원합니다.
이 튜토리얼은 Ray Tune을 파이토치 학습 workflow에 통합하는 방법을 알려줍니다. CIFAR10 이미지 분류기를 훈련하기 위해 파이토치 문서에서 이 튜토리얼을 확장할 것입니다.
아래와 같이 약간의 수정만 추가하면 됩니다.
함수에서 데이터 로딩 및 학습 부분을 감싸두고,
일부 네트워크 파라미터를 구성 가능하게 하고,
체크포인트를 추가하고 (선택 사항),
모델 튜닝을 위한 검색 공간을 정의합니다.
이 튜토리얼을 실행하기 위해 아래의 패키지가 설치되어 있는지 확인하세요:
ray[tune]
: 배포된 하이퍼파라미터 튜닝 라이브러리torchvision
: 데이터 변형을 위해 필요
설정 / 불러오기¶
import들로 시작합니다.
from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler
대부분의 import들은 파이토치 모델을 빌드하는데 필요합니다. 마지막 세 개의 import들만 Ray Tune을 사용하기 위한 것입니다.
Data loaders¶
data loader를 자체 함수로 감싸두고 전역 데이터 디렉토리로 전달합니다. 이런 식으로 서로 다른 실험들 간에 데이터 디렉토리를 공유할 수 있습니다.
def load_data(data_dir="./data"):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(
root=data_dir, train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(
root=data_dir, train=False, download=True, transform=transform)
return trainset, testset
구성 가능한 신경망¶
구성 가능한 파라미터만 튜닝이 가능합니다. 이 예시를 통해 fully connected layer 크기를 지정할 수 있습니다:
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, l1)
self.fc2 = nn.Linear(l1, l2)
self.fc3 = nn.Linear(l2, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
학습 함수¶
흥미를 더해보고자 파이토치 문서의 예제 일부를 변경하여 소개합니다.
학습 스크립트를 train_cifar(config, checkpoint_dir=None, data_dir=None)
함수로 감싸둡니다.
짐작할 수 있듯이, config
매개변수는 훈련할 하이퍼파라미터를 받습니다. checkpoint_dir
매개변수는 체크포인트를
복원하는 데 사용됩니다. data_dir
은 데이터를 읽고 저장하는 디렉토리를 지정하므로,
여러 실행들이 동일한 데이터 소스를 공유할 수 있습니다.
net = Net(config["l1"], config["l2"])
if checkpoint_dir:
model_state, optimizer_state = torch.load(
os.path.join(checkpoint_dir, "checkpoint"))
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
또한, 옵티마이저의 학습률(learning rate)을 구성할 수 있습니다.
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
또한 학습 데이터를 학습 및 검증 세트로 나눕니다. 따라서 데이터의 80%는 모델 학습에 사용하고, 나머지 20%에 대해 유효성 검사 및 손실을 계산합니다. 학습 및 테스트 세트를 반복하는 배치 크기도 구성할 수 있습니다.
DataParallel을 이용한 GPU(다중)지원 추가¶
이미지 분류는 GPU를 사용할 때 이점이 많습니다. 운좋게도 Ray Tune에서 파이토치의 추상화를 계속 사용할 수 있습니다.
따라서 여러 GPU에서 데이터 병렬 훈련을 지원하기 위해 모델을 nn.DataParallel
으로 감쌀 수 있습니다.
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
device
변수를 사용하여 사용 가능한 GPU가 없을 때도 학습이 가능한지 확인합니다.
파이토치는 다음과 같이 데이터를 GPU메모리에 명시적으로 보내도록 요구합니다.
for i, data in enumerate(trainloader, 0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
이 코드는 이제 CPU들, 단일 GPU 및 다중 GPU에 대한 학습을 지원합니다. 특히 Ray는 fractional-GPU 도 지원하므로 모델이 GPU 메모리에 적합한 상황에서는 테스트 간에 GPU를 공유할 수 있습니다. 이는 나중에 다룰 것입니다.
Ray Tune과 소통하기¶
가장 흥미로운 부분은 Ray Tune과의 소통입니다.
with tune.checkpoint_dir(epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save((net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
여기서 먼저 체크포인트를 저장한 다음 일부 메트릭을 Ray Tune에 다시 보냅니다. 특히, validation loss와 accuracy를 Ray Tune으로 다시 보냅니다. 그 후 Ray Tune은 이러한 메트릭을 사용하여 최상의 결과를 유도하는 하이퍼파라미터 구성을 결정할 수 있습니다. 이러한 메트릭들은 또한 리소스 낭비를 방지하기 위해 성능이 좋지 않은 실험을 조기에 중지하는 데 사용할 수 있습니다.
체크포인트 저장은 선택사항이지만 Population Based Training 과 같은 고급 스케줄러를 사용하려면 필요합니다. 또한 체크포인트를 저장하면 나중에 학습된 모델을 로드하고 평가 세트(test set)에서 검증할 수 있습니다.
Full training function¶
전체 예제 코드는 다음과 같습니다.
def train_cifar(config, checkpoint_dir=None, data_dir=None):
net = Net(config["l1"], config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
if checkpoint_dir:
model_state, optimizer_state = torch.load(
os.path.join(checkpoint_dir, "checkpoint"))
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
trainset, testset = load_data(data_dir)
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs])
trainloader = torch.utils.data.DataLoader(
train_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
valloader = torch.utils.data.DataLoader(
val_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
for epoch in range(10): # loop over the dataset multiple times
running_loss = 0.0
epoch_steps = 0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
epoch_steps += 1
if i % 2000 == 1999: # print every 2000 mini-batches
print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
running_loss / epoch_steps))
running_loss = 0.0
# Validation loss
val_loss = 0.0
val_steps = 0
total = 0
correct = 0
for i, data in enumerate(valloader, 0):
with torch.no_grad():
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
loss = criterion(outputs, labels)
val_loss += loss.cpu().numpy()
val_steps += 1
with tune.checkpoint_dir(epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save((net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
print("Finished Training")
보다시피, 대부분의 코드는 원본 예제에서 직접 적용되었습니다.
Test set 정확도(accuracy)¶
일반적으로 머신러닝 모델의 성능은 모델 학습에 사용되지 않은 데이터를 사용해 테스트합니다. Test set 또한 함수로 감싸둘 수 있습니다.
def test_accuracy(net, device="cpu"):
trainset, testset = load_data()
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=2)
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
images, labels = images.to(device), labels.to(device)
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
return correct / total
이 함수는 또한 device
파라미터를 요구하므로, test set 평가를 GPU에서 수행할 수 있습니다.
검색 공간 구성¶
마지막으로 Ray Tune의 검색 공간을 정의해야 합니다. 예시는 아래와 같습니다.
config = {
"l1": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
"l2": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
tune.sample_from()
함수를 사용하면 고유한 샘플 방법을 정의하여 하이퍼파라미터를 얻을 수 있습니다.
이 예제에서 l1
과 l2
파라미터는 4와 256 사이의 2의 거듭제곱이어야 하므로 4, 8, 16, 32, 64, 128, 256입니다.
lr
(학습률)은 0.0001과 0.1 사이에서 균일하게 샘플링 되어야 합니다. 마지막으로, 배치 크기는 2, 4, 8, 16중에서 선택할 수 있습니다.
각 실험에서, Ray Tune은 이제 이러한 검색 공간에서 매개변수 조합을 무작위로 샘플링합니다.
그런 다음 여러 모델을 병렬로 훈련하고 이 중에서 가장 성능이 좋은 모델을 찾습니다. 또한 성능이 좋지 않은 실험을 조기에 종료하는 ASHAScheduler
를 사용합니다.
상수 data_dir
파라미터를 설정하기 위해 functools.partial
로 train_cifar
함수를 감싸둡니다. 또한 각 실험에 사용할 수 있는 자원들(resources)을 Ray Tune에 알릴 수 있습니다.
gpus_per_trial = 2
# ...
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
progress_reporter=reporter,
checkpoint_at_end=True)
파이토치 DataLoader
인스턴스의 num_workers
을 늘리기 위해 CPU 수를 지정하고 사용할 수 있습니다.
각 실험에서 선택한 수의 GPU들은 파이토치에 표시됩니다. 실험들은 요청되지 않은 GPU에 액세스할 수 없으므로 같은 자원들을 사용하는 중복된 실험에 대해 신경쓰지 않아도 됩니다.
부분 GPUs를 지정할 수도 있으므로, gpus_per_trial=0.5
와 같은 것 또한 가능합니다. 이후 각 실험은 GPU를 공유합니다. 사용자는 모델이 여전히 GPU메모리에 적합한지만 확인하면 됩니다.
모델을 훈련시킨 후, 가장 성능이 좋은 모델을 찾고 체크포인트 파일에서 학습된 모델을 로드합니다. 이후 test set 정확도(accuracy)를 얻고 모든 것들을 출력하여 확인할 수 있습니다.
전체 주요 기능은 다음과 같습니다.
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
data_dir = os.path.abspath("./data")
load_data(data_dir)
config = {
"l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
scheduler = ASHAScheduler(
metric="loss",
mode="min",
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2)
reporter = CLIReporter(
# ``parameter_columns=["l1", "l2", "lr", "batch_size"]``,
metric_columns=["loss", "accuracy", "training_iteration"])
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
progress_reporter=reporter)
best_trial = result.get_best_trial("loss", "min", "last")
print("Best trial config: {}".format(best_trial.config))
print("Best trial final validation loss: {}".format(
best_trial.last_result["loss"]))
print("Best trial final validation accuracy: {}".format(
best_trial.last_result["accuracy"]))
best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if gpus_per_trial > 1:
best_trained_model = nn.DataParallel(best_trained_model)
best_trained_model.to(device)
best_checkpoint_dir = best_trial.checkpoint.value
model_state, optimizer_state = torch.load(os.path.join(
best_checkpoint_dir, "checkpoint"))
best_trained_model.load_state_dict(model_state)
test_acc = test_accuracy(best_trained_model, device)
print("Best trial test set accuracy: {}".format(test_acc))
if __name__ == "__main__":
# You can change the number of GPUs per trial here:
main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /workspace/ko-latest/beginner_source/data/cifar-10-python.tar.gz
0%| | 0/170498071 [00:00<?, ?it/s]
0%| | 32768/170498071 [00:00<15:46, 180075.88it/s]
0%| | 65536/170498071 [00:00<15:46, 180084.62it/s]
0%|1 | 98304/170498071 [00:00<15:40, 181086.27it/s]
0%|2 | 229376/170498071 [00:00<07:11, 394438.81it/s]
0%|4 | 458752/170498071 [00:00<04:00, 707456.35it/s]
0%|8 | 786432/170498071 [00:01<02:37, 1080213.64it/s]
1%|#7 | 1605632/170498071 [00:01<01:16, 2201644.66it/s]
2%|###2 | 3047424/170498071 [00:01<00:41, 4021129.56it/s]
3%|######1 | 5799936/170498071 [00:01<00:22, 7455724.32it/s]
5%|#########3 | 8847360/170498071 [00:01<00:15, 10272523.58it/s]
7%|############4 | 11894784/170498071 [00:02<00:12, 12251705.01it/s]
9%|###############6 | 14942208/170498071 [00:02<00:11, 13630631.50it/s]
11%|##################8 | 17989632/170498071 [00:02<00:10, 14589948.49it/s]
12%|###################### | 21037056/170498071 [00:02<00:09, 15255323.98it/s]
14%|#########################1 | 23986176/170498071 [00:02<00:09, 15554416.32it/s]
16%|############################3 | 26968064/170498071 [00:02<00:09, 15826737.71it/s]
18%|###############################5 | 30081024/170498071 [00:03<00:08, 16140736.39it/s]
19%|##################################8 | 33226752/170498071 [00:03<00:08, 16451898.65it/s]
21%|######################################1 | 36306944/170498071 [00:03<00:08, 16711893.80it/s]
23%|#########################################3 | 39387136/170498071 [00:03<00:07, 16742010.84it/s]
25%|############################################5 | 42434560/170498071 [00:03<00:07, 16806870.61it/s]
27%|###############################################8 | 45580288/170498071 [00:04<00:07, 16847591.08it/s]
28%|##################################################9 | 48562176/170498071 [00:04<00:07, 16760684.24it/s]
30%|######################################################1 | 51544064/170498071 [00:04<00:07, 16637141.32it/s]
32%|#########################################################3 | 54591488/170498071 [00:04<00:06, 16674688.20it/s]
34%|############################################################6 | 57737216/170498071 [00:04<00:06, 16797070.92it/s]
36%|###############################################################8 | 60850176/170498071 [00:04<00:06, 16952259.52it/s]
37%|##################################################################2 | 63143936/170498071 [00:05<00:05, 18082730.64it/s]
38%|####################################################################2 | 65044480/170498071 [00:05<00:06, 16044807.30it/s]
39%|######################################################################4 | 67076096/170498071 [00:05<00:06, 16872103.41it/s]
40%|########################################################################2 | 68845568/170498071 [00:05<00:05, 16976429.90it/s]
41%|##########################################################################1 | 70615040/170498071 [00:05<00:06, 14899956.78it/s]
43%|############################################################################8 | 73236480/170498071 [00:05<00:05, 17451974.48it/s]
44%|##############################################################################8 | 75104256/170498071 [00:05<00:05, 17251511.17it/s]
45%|################################################################################7 | 76906496/170498071 [00:05<00:06, 14503241.41it/s]
47%|###################################################################################7 | 79757312/170498071 [00:06<00:06, 14755848.98it/s]
49%|######################################################################################9 | 82837504/170498071 [00:06<00:05, 15504472.65it/s]
50%|##########################################################################################2 | 85950464/170498071 [00:06<00:05, 16012328.10it/s]
52%|#############################################################################################4 | 88965120/170498071 [00:06<00:05, 15726101.64it/s]
54%|################################################################################################6 | 92012544/170498071 [00:06<00:04, 15821094.08it/s]
56%|###################################################################################################6 | 94896128/170498071 [00:07<00:04, 15700553.81it/s]
57%|######################################################################################################6 | 97812480/170498071 [00:07<00:04, 15721113.87it/s]
59%|######################################################################################################### | 100663296/170498071 [00:07<00:04, 15607916.77it/s]
61%|############################################################################################################2 | 103677952/170498071 [00:07<00:04, 15687557.75it/s]
62%|###############################################################################################################2 | 106528768/170498071 [00:07<00:04, 15568099.08it/s]
64%|##################################################################################################################2 | 109477888/170498071 [00:08<00:03, 15564183.05it/s]
66%|#####################################################################################################################2 | 112328704/170498071 [00:08<00:03, 15435039.74it/s]
68%|########################################################################################################################3 | 115245056/170498071 [00:08<00:03, 15509873.07it/s]
69%|###########################################################################################################################3 | 118161408/170498071 [00:08<00:03, 15509969.29it/s]
71%|##############################################################################################################################3 | 121044992/170498071 [00:08<00:03, 15555890.03it/s]
73%|#################################################################################################################################5 | 124092416/170498071 [00:08<00:02, 15696796.58it/s]
74%|####################################################################################################################################5 | 127008768/170498071 [00:09<00:02, 15748270.31it/s]
76%|#######################################################################################################################################7 | 129990656/170498071 [00:09<00:02, 15863620.77it/s]
78%|##########################################################################################################################################7 | 132907008/170498071 [00:09<00:02, 15805796.90it/s]
80%|#############################################################################################################################################7 | 135757824/170498071 [00:09<00:02, 15659991.74it/s]
81%|################################################################################################################################################7 | 138608640/170498071 [00:09<00:02, 15585026.81it/s]
83%|###################################################################################################################################################6 | 141459456/170498071 [00:10<00:01, 15564052.53it/s]
85%|######################################################################################################################################################7 | 144408576/170498071 [00:10<00:01, 15700248.42it/s]
86%|#########################################################################################################################################################8 | 147324928/170498071 [00:10<00:01, 15688131.05it/s]
88%|############################################################################################################################################################9 | 150306816/170498071 [00:10<00:01, 15853957.11it/s]
90%|###############################################################################################################################################################8 | 153157632/170498071 [00:10<00:01, 15741332.10it/s]
92%|################################################################################################################################################################### | 156172288/170498071 [00:10<00:00, 15925350.53it/s]
93%|###################################################################################################################################################################### | 159088640/170498071 [00:11<00:00, 15903665.89it/s]
95%|######################################################################################################################################################################### | 161939456/170498071 [00:11<00:00, 15784769.65it/s]
96%|###########################################################################################################################################################################4 | 164200448/170498071 [00:11<00:00, 17061199.62it/s]
97%|#############################################################################################################################################################################4 | 166166528/170498071 [00:11<00:00, 17607671.15it/s]
99%|###############################################################################################################################################################################4 | 168034304/170498071 [00:11<00:00, 15234363.38it/s]
100%|#################################################################################################################################################################################1| 169672704/170498071 [00:11<00:00, 14634162.67it/s]
100%|##################################################################################################################################################################################| 170498071/170498071 [00:11<00:00, 14384279.50it/s]
Extracting /workspace/ko-latest/beginner_source/data/cifar-10-python.tar.gz to /workspace/ko-latest/beginner_source/data
Files already downloaded and verified
2023-05-06 11:42:05,821 WARNING services.py:2002 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 4294963200 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-05-06 11:42:08,345 WARNING tune.py:668 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
2023-05-06 11:42:08,694 ERROR syncer.py:147 -- Log sync requires rsync to be installed.
== Status ==
Current time: 2023-05-06 11:42:08 (running for 00:00:00.36)
Memory usage on this node: 12.7/62.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (9 PENDING, 1 RUNNING)
+-------------------------+----------+------------------+--------------+------+------+-------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr |
|-------------------------+----------+------------------+--------------+------+------+-------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 |
| train_cifar_9812d_00001 | PENDING | | 16 | 256 | 4 | 0.000744654 |
| train_cifar_9812d_00002 | PENDING | | 16 | 32 | 16 | 0.000578039 |
| train_cifar_9812d_00003 | PENDING | | 16 | 64 | 256 | 0.00487814 |
| train_cifar_9812d_00004 | PENDING | | 8 | 8 | 256 | 0.00547006 |
| train_cifar_9812d_00005 | PENDING | | 16 | 256 | 32 | 0.00313045 |
| train_cifar_9812d_00006 | PENDING | | 2 | 16 | 32 | 0.0225081 |
| train_cifar_9812d_00007 | PENDING | | 2 | 256 | 8 | 0.00253129 |
| train_cifar_9812d_00008 | PENDING | | 2 | 64 | 128 | 0.00101019 |
| train_cifar_9812d_00009 | PENDING | | 8 | 64 | 4 | 0.000166221 |
+-------------------------+----------+------------------+--------------+------+------+-------------+
(func pid=16375) Files already downloaded and verified
(func pid=16375) Files already downloaded and verified
(func pid=16431) Files already downloaded and verified
(func pid=16439) Files already downloaded and verified
(func pid=16429) Files already downloaded and verified
(func pid=16426) Files already downloaded and verified
(func pid=16425) Files already downloaded and verified
(func pid=16433) Files already downloaded and verified
(func pid=16423) Files already downloaded and verified
(func pid=16437) Files already downloaded and verified
(func pid=16435) Files already downloaded and verified
== Status ==
Current time: 2023-05-06 11:42:17 (running for 00:00:08.92)
Memory usage on this node: 15.7/62.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 20.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+------------------+--------------+------+------+-------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr |
|-------------------------+----------+------------------+--------------+------+------+-------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 |
| train_cifar_9812d_00001 | RUNNING | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 |
| train_cifar_9812d_00002 | RUNNING | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 |
| train_cifar_9812d_00009 | RUNNING | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 |
+-------------------------+----------+------------------+--------------+------+------+-------------+
(func pid=16426) Files already downloaded and verified
(func pid=16433) Files already downloaded and verified
(func pid=16423) Files already downloaded and verified
(func pid=16435) Files already downloaded and verified
(func pid=16431) Files already downloaded and verified
(func pid=16439) Files already downloaded and verified
(func pid=16429) Files already downloaded and verified
(func pid=16425) Files already downloaded and verified
(func pid=16437) Files already downloaded and verified
== Status ==
Current time: 2023-05-06 11:42:22 (running for 00:00:13.95)
Memory usage on this node: 18.7/62.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 20.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+------------------+--------------+------+------+-------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr |
|-------------------------+----------+------------------+--------------+------+------+-------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 |
| train_cifar_9812d_00001 | RUNNING | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 |
| train_cifar_9812d_00002 | RUNNING | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 |
| train_cifar_9812d_00009 | RUNNING | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 |
+-------------------------+----------+------------------+--------------+------+------+-------------+
(func pid=16433) [1, 2000] loss: 2.317
== Status ==
Current time: 2023-05-06 11:42:27 (running for 00:00:18.96)
Memory usage on this node: 18.8/62.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 20.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+------------------+--------------+------+------+-------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr |
|-------------------------+----------+------------------+--------------+------+------+-------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 |
| train_cifar_9812d_00001 | RUNNING | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 |
| train_cifar_9812d_00002 | RUNNING | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 |
| train_cifar_9812d_00009 | RUNNING | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 |
+-------------------------+----------+------------------+--------------+------+------+-------------+
(func pid=16375) [1, 2000] loss: 1.888
(func pid=16435) [1, 2000] loss: 2.264
(func pid=16437) [1, 2000] loss: 2.250
(func pid=16439) [1, 2000] loss: 2.339
(func pid=16429) [1, 2000] loss: 1.915
== Status ==
Current time: 2023-05-06 11:42:32 (running for 00:00:23.98)
Memory usage on this node: 18.8/62.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 20.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+------------------+--------------+------+------+-------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr |
|-------------------------+----------+------------------+--------------+------+------+-------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 |
| train_cifar_9812d_00001 | RUNNING | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 |
| train_cifar_9812d_00002 | RUNNING | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 |
| train_cifar_9812d_00009 | RUNNING | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 |
+-------------------------+----------+------------------+--------------+------+------+-------------+
(func pid=16425) [1, 2000] loss: 2.275
(func pid=16426) [1, 2000] loss: 1.803
Result for train_cifar_9812d_00000:
accuracy: 0.4427
date: 2023-05-06_11-42-33
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 1.5169920167922974
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 21.45641016960144
time_this_iter_s: 21.45641016960144
time_total_s: 21.45641016960144
timestamp: 1683340953
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
(func pid=16431) [1, 2000] loss: 1.861
(func pid=16423) [1, 2000] loss: 2.286
(func pid=16433) [1, 4000] loss: 1.162
(func pid=16437) [1, 4000] loss: 0.979
(func pid=16435) [1, 4000] loss: 1.014
== Status ==
Current time: 2023-05-06 11:42:38 (running for 00:00:30.31)
Memory usage on this node: 18.7/62.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -1.5169920167922974
Resources requested: 20.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+----------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.51699 | 0.4427 | 1 |
| train_cifar_9812d_00001 | RUNNING | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | | | |
| train_cifar_9812d_00002 | RUNNING | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | | | |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | | | |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | | | |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | | | |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00009 | RUNNING | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | | | |
+-------------------------+----------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00002:
accuracy: 0.2193
date: 2023-05-06_11-42-38
done: true
experiment_id: dced52322a924a54a681025c1f0176e5
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 2.0658682716369627
node_ip: 172.17.0.2
pid: 16425
should_checkpoint: true
time_since_restore: 22.5540030002594
time_this_iter_s: 22.5540030002594
time_total_s: 22.5540030002594
timestamp: 1683340958
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00002
warmup_time: 0.004146099090576172
Result for train_cifar_9812d_00003:
accuracy: 0.4817
date: 2023-05-06_11-42-39
done: false
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 1.432033909034729
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 22.88657021522522
time_this_iter_s: 22.88657021522522
time_total_s: 22.88657021522522
timestamp: 1683340959
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
(func pid=16439) [1, 4000] loss: 1.160
Result for train_cifar_9812d_00001:
accuracy: 0.227
date: 2023-05-06_11-42-40
done: true
experiment_id: 2a7f2b215cfe46b0989f522a54331a8a
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 2.1031290491104127
node_ip: 172.17.0.2
pid: 16423
should_checkpoint: true
time_since_restore: 24.080467700958252
time_this_iter_s: 24.080467700958252
time_total_s: 24.080467700958252
timestamp: 1683340960
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00001
warmup_time: 0.004377603530883789
Result for train_cifar_9812d_00005:
accuracy: 0.4492
date: 2023-05-06_11-42-40
done: false
experiment_id: d41dafbd385442dcb50ad65850f13970
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 1.4909102236747742
node_ip: 172.17.0.2
pid: 16431
should_checkpoint: true
time_since_restore: 24.44042420387268
time_this_iter_s: 24.44042420387268
time_total_s: 24.44042420387268
timestamp: 1683340960
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00005
warmup_time: 0.0046710968017578125
(func pid=16429) [1, 4000] loss: 0.825
(func pid=16433) [1, 6000] loss: 0.776
(func pid=16437) [1, 6000] loss: 0.602
(func pid=16435) [1, 6000] loss: 0.640
== Status ==
Current time: 2023-05-06 11:42:45 (running for 00:00:37.21)
Memory usage on this node: 17.4/62.7 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -1.5169920167922974
Resources requested: 16.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (8 RUNNING, 2 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.51699 | 0.4427 | 1 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.43203 | 0.4817 | 1 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | | | |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.49091 | 0.4492 | 1 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00009 | RUNNING | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16375) [2, 2000] loss: 1.440
Result for train_cifar_9812d_00009:
accuracy: 0.099
date: 2023-05-06_11-42-48
done: true
experiment_id: ea91200cbaf94d55a2136fa5fa4e5e26
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 2.3111555027008057
node_ip: 172.17.0.2
pid: 16439
should_checkpoint: true
time_since_restore: 32.00885462760925
time_this_iter_s: 32.00885462760925
time_total_s: 32.00885462760925
timestamp: 1683340968
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00009
warmup_time: 0.004668474197387695
Result for train_cifar_9812d_00004:
accuracy: 0.4286
date: 2023-05-06_11-42-49
done: false
experiment_id: 8eb27a117a5845e9afe5fdbd364fa30f
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 1.564001158285141
node_ip: 172.17.0.2
pid: 16429
should_checkpoint: true
time_since_restore: 33.02037191390991
time_this_iter_s: 33.02037191390991
time_total_s: 33.02037191390991
timestamp: 1683340969
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00004
warmup_time: 0.004842996597290039
(func pid=16433) [1, 8000] loss: 0.583
(func pid=16437) [1, 8000] loss: 0.420
Result for train_cifar_9812d_00000:
accuracy: 0.5157
date: 2023-05-06_11-42-52
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 2
loss: 1.3616330410957336
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 39.94718647003174
time_this_iter_s: 18.490776300430298
time_total_s: 39.94718647003174
timestamp: 1683340972
timesteps_since_restore: 0
training_iteration: 2
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
== Status ==
Current time: 2023-05-06 11:42:52 (running for 00:00:43.79)
Memory usage on this node: 16.7/62.7 GiB
Using AsyncHyperBand: num_stopped=3
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.3616330410957336 | Iter 1.000: -1.564001158285141
Resources requested: 14.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (7 RUNNING, 3 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.36163 | 0.5157 | 2 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.43203 | 0.4817 | 1 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.564 | 0.4286 | 1 |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.49091 | 0.4492 | 1 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16426) [2, 2000] loss: 1.356
(func pid=16435) [1, 8000] loss: 0.455
(func pid=16431) [2, 2000] loss: 1.418
== Status ==
Current time: 2023-05-06 11:42:57 (running for 00:00:48.81)
Memory usage on this node: 16.8/62.7 GiB
Using AsyncHyperBand: num_stopped=3
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.3616330410957336 | Iter 1.000: -1.564001158285141
Resources requested: 14.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (7 RUNNING, 3 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.36163 | 0.5157 | 2 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.43203 | 0.4817 | 1 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.564 | 0.4286 | 1 |
| train_cifar_9812d_00005 | RUNNING | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.49091 | 0.4492 | 1 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00003:
accuracy: 0.5369
date: 2023-05-06_11-42-58
done: false
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 2
loss: 1.318504800415039
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 42.04812407493591
time_this_iter_s: 19.161553859710693
time_total_s: 42.04812407493591
timestamp: 1683340978
timesteps_since_restore: 0
training_iteration: 2
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
(func pid=16433) [1, 10000] loss: 0.466
(func pid=16437) [1, 10000] loss: 0.323
(func pid=16429) [2, 2000] loss: 1.546
Result for train_cifar_9812d_00005:
accuracy: 0.5065
date: 2023-05-06_11-43-00
done: true
experiment_id: d41dafbd385442dcb50ad65850f13970
hostname: 83df70b6be24
iterations_since_restore: 2
loss: 1.3553007165908812
node_ip: 172.17.0.2
pid: 16431
should_checkpoint: true
time_since_restore: 44.66352725028992
time_this_iter_s: 20.223103046417236
time_total_s: 44.66352725028992
timestamp: 1683340980
timesteps_since_restore: 0
training_iteration: 2
trial_id: 9812d_00005
warmup_time: 0.0046710968017578125
(func pid=16435) [1, 10000] loss: 0.357
(func pid=16375) [3, 2000] loss: 1.300
== Status ==
Current time: 2023-05-06 11:43:05 (running for 00:00:57.41)
Memory usage on this node: 16.2/62.7 GiB
Using AsyncHyperBand: num_stopped=4
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.3553007165908812 | Iter 1.000: -1.564001158285141
Resources requested: 12.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (6 RUNNING, 4 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.36163 | 0.5157 | 2 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.3185 | 0.5369 | 2 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.564 | 0.4286 | 1 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16437) [1, 12000] loss: 0.261
(func pid=16433) [1, 12000] loss: 0.388
(func pid=16429) [2, 4000] loss: 0.761
(func pid=16435) [1, 12000] loss: 0.290
Result for train_cifar_9812d_00000:
accuracy: 0.529
date: 2023-05-06_11-43-09
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 3
loss: 1.3073500599861145
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 57.226059675216675
time_this_iter_s: 17.278873205184937
time_total_s: 57.226059675216675
timestamp: 1683340989
timesteps_since_restore: 0
training_iteration: 3
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
(func pid=16426) [3, 2000] loss: 1.227
(func pid=16433) [1, 14000] loss: 0.333
(func pid=16437) [1, 14000] loss: 0.219
== Status ==
Current time: 2023-05-06 11:43:14 (running for 00:01:06.09)
Memory usage on this node: 16.1/62.7 GiB
Using AsyncHyperBand: num_stopped=4
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.3553007165908812 | Iter 1.000: -1.564001158285141
Resources requested: 12.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (6 RUNNING, 4 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.30735 | 0.529 | 3 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.3185 | 0.5369 | 2 |
| train_cifar_9812d_00004 | RUNNING | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.564 | 0.4286 | 1 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16435) [1, 14000] loss: 0.246
Result for train_cifar_9812d_00003:
accuracy: 0.5733
date: 2023-05-06_11-43-16
done: false
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 3
loss: 1.2381216032505036
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 60.114490270614624
time_this_iter_s: 18.06636619567871
time_total_s: 60.114490270614624
timestamp: 1683340996
timesteps_since_restore: 0
training_iteration: 3
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
Result for train_cifar_9812d_00004:
accuracy: 0.4336
date: 2023-05-06_11-43-16
done: true
experiment_id: 8eb27a117a5845e9afe5fdbd364fa30f
hostname: 83df70b6be24
iterations_since_restore: 2
loss: 1.5959759364843369
node_ip: 172.17.0.2
pid: 16429
should_checkpoint: true
time_since_restore: 60.17239832878113
time_this_iter_s: 27.152026414871216
time_total_s: 60.17239832878113
timestamp: 1683340996
timesteps_since_restore: 0
training_iteration: 2
trial_id: 9812d_00004
warmup_time: 0.004842996597290039
(func pid=16437) [1, 16000] loss: 0.189
(func pid=16433) [1, 16000] loss: 0.291
(func pid=16375) [4, 2000] loss: 1.234
== Status ==
Current time: 2023-05-06 11:43:21 (running for 00:01:13.16)
Memory usage on this node: 15.5/62.7 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.564001158285141
Resources requested: 10.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (5 RUNNING, 5 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.30735 | 0.529 | 3 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.23812 | 0.5733 | 3 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16435) [1, 16000] loss: 0.212
== Status ==
Current time: 2023-05-06 11:43:26 (running for 00:01:18.17)
Memory usage on this node: 15.5/62.7 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.564001158285141
Resources requested: 10.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (5 RUNNING, 5 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.30735 | 0.529 | 3 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.23812 | 0.5733 | 3 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00000:
accuracy: 0.5406
date: 2023-05-06_11-43-26
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 4
loss: 1.3072022185325622
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 74.60453009605408
time_this_iter_s: 17.378470420837402
time_total_s: 74.60453009605408
timestamp: 1683341006
timesteps_since_restore: 0
training_iteration: 4
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
(func pid=16437) [1, 18000] loss: 0.167
(func pid=16433) [1, 18000] loss: 0.259
(func pid=16426) [4, 2000] loss: 1.124
(func pid=16435) [1, 18000] loss: 0.186
== Status ==
Current time: 2023-05-06 11:43:31 (running for 00:01:23.46)
Memory usage on this node: 15.5/62.7 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: None | Iter 4.000: -1.3072022185325622 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.564001158285141
Resources requested: 10.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (5 RUNNING, 5 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.3072 | 0.5406 | 4 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.23812 | 0.5733 | 3 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16437) [1, 20000] loss: 0.146
Result for train_cifar_9812d_00003:
accuracy: 0.5675
date: 2023-05-06_11-43-34
done: false
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 4
loss: 1.2270443179130555
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 77.74592351913452
time_this_iter_s: 17.631433248519897
time_total_s: 77.74592351913452
timestamp: 1683341014
timesteps_since_restore: 0
training_iteration: 4
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
(func pid=16433) [1, 20000] loss: 0.233
(func pid=16435) [1, 20000] loss: 0.173
== Status ==
Current time: 2023-05-06 11:43:39 (running for 00:01:30.70)
Memory usage on this node: 15.5/62.7 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.564001158285141
Resources requested: 10.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (5 RUNNING, 5 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.3072 | 0.5406 | 4 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.22704 | 0.5675 | 4 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16375) [5, 2000] loss: 1.180
== Status ==
Current time: 2023-05-06 11:43:44 (running for 00:01:35.71)
Memory usage on this node: 15.5/62.7 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.564001158285141
Resources requested: 10.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (5 RUNNING, 5 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.3072 | 0.5406 | 4 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.22704 | 0.5675 | 4 |
| train_cifar_9812d_00006 | RUNNING | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | | | |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | | | |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | | | |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00000:
accuracy: 0.5676
date: 2023-05-06_11-43-44
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 5
loss: 1.2511685856342316
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 92.19643640518188
time_this_iter_s: 17.591906309127808
time_total_s: 92.19643640518188
timestamp: 1683341024
timesteps_since_restore: 0
training_iteration: 5
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
Result for train_cifar_9812d_00008:
accuracy: 0.4996
date: 2023-05-06_11-43-44
done: false
experiment_id: 618bb6e0b05040cab4eceb67166581e1
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 1.4013006913859398
node_ip: 172.17.0.2
pid: 16437
should_checkpoint: true
time_since_restore: 88.10317134857178
time_this_iter_s: 88.10317134857178
time_total_s: 88.10317134857178
timestamp: 1683341024
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00008
warmup_time: 0.0038585662841796875
Result for train_cifar_9812d_00006:
accuracy: 0.0995
date: 2023-05-06_11-43-44
done: true
experiment_id: 77d5a539506241a49548fe6ea5bbcd25
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 2.32231702170372
node_ip: 172.17.0.2
pid: 16433
should_checkpoint: true
time_since_restore: 88.51138973236084
time_this_iter_s: 88.51138973236084
time_total_s: 88.51138973236084
timestamp: 1683341024
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00006
warmup_time: 0.0042417049407958984
(func pid=16426) [5, 2000] loss: 1.078
Result for train_cifar_9812d_00007:
accuracy: 0.3814
date: 2023-05-06_11-43-49
done: true
experiment_id: 4699f3a8af0943fdaee12a99c34bc818
hostname: 83df70b6be24
iterations_since_restore: 1
loss: 1.7009081659432965
node_ip: 172.17.0.2
pid: 16435
should_checkpoint: true
time_since_restore: 92.76529359817505
time_this_iter_s: 92.76529359817505
time_total_s: 92.76529359817505
timestamp: 1683341029
timesteps_since_restore: 0
training_iteration: 1
trial_id: 9812d_00007
warmup_time: 0.00432586669921875
== Status ==
Current time: 2023-05-06 11:43:49 (running for 00:01:40.83)
Memory usage on this node: 14.8/62.7 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 8.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (4 RUNNING, 6 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.25117 | 0.5676 | 5 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.22704 | 0.5675 | 4 |
| train_cifar_9812d_00007 | RUNNING | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16437) [2, 2000] loss: 1.436
Result for train_cifar_9812d_00003:
accuracy: 0.588
date: 2023-05-06_11-43-51
done: false
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 5
loss: 1.219955228471756
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 95.55872774124146
time_this_iter_s: 17.812804222106934
time_total_s: 95.55872774124146
timestamp: 1683341031
timesteps_since_restore: 0
training_iteration: 5
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
(func pid=16375) [6, 2000] loss: 1.141
== Status ==
Current time: 2023-05-06 11:43:56 (running for 00:01:48.53)
Memory usage on this node: 14.2/62.7 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 6.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.25117 | 0.5676 | 5 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.21996 | 0.588 | 5 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16437) [2, 4000] loss: 0.697
Result for train_cifar_9812d_00000:
accuracy: 0.5768
date: 2023-05-06_11-44-01
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 6
loss: 1.195042846918106
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 109.25635170936584
time_this_iter_s: 17.05991530418396
time_total_s: 109.25635170936584
timestamp: 1683341041
timesteps_since_restore: 0
training_iteration: 6
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
(func pid=16426) [6, 2000] loss: 1.022
(func pid=16437) [2, 6000] loss: 0.467
== Status ==
Current time: 2023-05-06 11:44:06 (running for 00:01:58.10)
Memory usage on this node: 14.2/62.7 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 6.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.19504 | 0.5768 | 6 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.21996 | 0.588 | 5 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00003:
accuracy: 0.5908
date: 2023-05-06_11-44-08
done: false
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 6
loss: 1.2207209486484527
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 112.67771744728088
time_this_iter_s: 17.11898970603943
time_total_s: 112.67771744728088
timestamp: 1683341048
timesteps_since_restore: 0
training_iteration: 6
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
(func pid=16437) [2, 8000] loss: 0.343
(func pid=16375) [7, 2000] loss: 1.106
== Status ==
Current time: 2023-05-06 11:44:13 (running for 00:02:05.64)
Memory usage on this node: 14.2/62.7 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 6.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.19504 | 0.5768 | 6 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.22072 | 0.5908 | 6 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00000:
accuracy: 0.5922
date: 2023-05-06_11-44-18
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 7
loss: 1.1649671128749848
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 125.83338117599487
time_this_iter_s: 16.57702946662903
time_total_s: 125.83338117599487
timestamp: 1683341058
timesteps_since_restore: 0
training_iteration: 7
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
(func pid=16437) [2, 10000] loss: 0.275
(func pid=16426) [7, 2000] loss: 0.984
== Status ==
Current time: 2023-05-06 11:44:23 (running for 00:02:14.69)
Memory usage on this node: 14.2/62.7 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 6.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.16497 | 0.5922 | 7 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.22072 | 0.5908 | 6 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16437) [2, 12000] loss: 0.226
Result for train_cifar_9812d_00003:
accuracy: 0.5929
date: 2023-05-06_11-44-26
done: false
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 7
loss: 1.2076413149356842
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 130.16873908042908
time_this_iter_s: 17.491021633148193
time_total_s: 130.16873908042908
timestamp: 1683341066
timesteps_since_restore: 0
training_iteration: 7
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
(func pid=16375) [8, 2000] loss: 1.093
== Status ==
Current time: 2023-05-06 11:44:31 (running for 00:02:23.12)
Memory usage on this node: 14.2/62.7 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: None | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 6.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.16497 | 0.5922 | 7 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.20764 | 0.5929 | 7 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16437) [2, 14000] loss: 0.196
Result for train_cifar_9812d_00000:
accuracy: 0.5933
date: 2023-05-06_11-44-35
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 8
loss: 1.1752527618408204
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 143.7847933769226
time_this_iter_s: 17.951412200927734
time_total_s: 143.7847933769226
timestamp: 1683341075
timesteps_since_restore: 0
training_iteration: 8
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
(func pid=16426) [8, 2000] loss: 0.958
(func pid=16437) [2, 16000] loss: 0.168
== Status ==
Current time: 2023-05-06 11:44:40 (running for 00:02:32.64)
Memory usage on this node: 14.2/62.7 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: -1.1752527618408204 | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 6.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.17525 | 0.5933 | 8 |
| train_cifar_9812d_00003 | RUNNING | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.20764 | 0.5929 | 7 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00003:
accuracy: 0.5822
date: 2023-05-06_11-44-43
done: true
experiment_id: 970215e7451c4886bcb97cffbd6240bf
hostname: 83df70b6be24
iterations_since_restore: 8
loss: 1.2155846611976624
node_ip: 172.17.0.2
pid: 16426
should_checkpoint: true
time_since_restore: 147.44374871253967
time_this_iter_s: 17.275009632110596
time_total_s: 147.44374871253967
timestamp: 1683341083
timesteps_since_restore: 0
training_iteration: 8
trial_id: 9812d_00003
warmup_time: 0.004241466522216797
(func pid=16437) [2, 18000] loss: 0.147
(func pid=16375) [9, 2000] loss: 1.070
== Status ==
Current time: 2023-05-06 11:44:48 (running for 00:02:40.41)
Memory usage on this node: 13.6/62.7 GiB
Using AsyncHyperBand: num_stopped=8
Bracket: Iter 8.000: -1.1954187115192414 | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 4.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (2 RUNNING, 8 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.17525 | 0.5933 | 8 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00003 | TERMINATED | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.21558 | 0.5822 | 8 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
(func pid=16437) [2, 20000] loss: 0.133
Result for train_cifar_9812d_00000:
accuracy: 0.6074
date: 2023-05-06_11-44-52
done: false
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 9
loss: 1.1338796653270722
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 160.29088473320007
time_this_iter_s: 16.506091356277466
time_total_s: 160.29088473320007
timestamp: 1683341092
timesteps_since_restore: 0
training_iteration: 9
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
== Status ==
Current time: 2023-05-06 11:44:57 (running for 00:02:49.14)
Memory usage on this node: 13.5/62.7 GiB
Using AsyncHyperBand: num_stopped=8
Bracket: Iter 8.000: -1.1954187115192414 | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 4.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (2 RUNNING, 8 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.13388 | 0.6074 | 9 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00003 | TERMINATED | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.21558 | 0.5822 | 8 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
== Status ==
Current time: 2023-05-06 11:45:02 (running for 00:02:54.16)
Memory usage on this node: 13.5/62.7 GiB
Using AsyncHyperBand: num_stopped=8
Bracket: Iter 8.000: -1.1954187115192414 | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3584668788433074 | Iter 1.000: -1.6324546621142186
Resources requested: 4.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (2 RUNNING, 8 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.13388 | 0.6074 | 9 |
| train_cifar_9812d_00008 | RUNNING | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.4013 | 0.4996 | 1 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00003 | TERMINATED | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.21558 | 0.5822 | 8 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00008:
accuracy: 0.5167
date: 2023-05-06_11-45-03
done: true
experiment_id: 618bb6e0b05040cab4eceb67166581e1
hostname: 83df70b6be24
iterations_since_restore: 2
loss: 1.362510330351768
node_ip: 172.17.0.2
pid: 16437
should_checkpoint: true
time_since_restore: 166.71913599967957
time_this_iter_s: 78.61596465110779
time_total_s: 166.71913599967957
timestamp: 1683341103
timesteps_since_restore: 0
training_iteration: 2
trial_id: 9812d_00008
warmup_time: 0.0038585662841796875
(func pid=16375) [10, 2000] loss: 1.056
== Status ==
Current time: 2023-05-06 11:45:08 (running for 00:02:59.80)
Memory usage on this node: 12.9/62.7 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.1954187115192414 | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3616330410957336 | Iter 1.000: -1.6324546621142186
Resources requested: 2.0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | RUNNING | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.13388 | 0.6074 | 9 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00003 | TERMINATED | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.21558 | 0.5822 | 8 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00008 | TERMINATED | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.36251 | 0.5167 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
Result for train_cifar_9812d_00000:
accuracy: 0.5944
date: 2023-05-06_11-45-08
done: true
experiment_id: 4288901cf29e43b8ac8c6a7047b15259
hostname: 83df70b6be24
iterations_since_restore: 10
loss: 1.2023105457782746
node_ip: 172.17.0.2
pid: 16375
should_checkpoint: true
time_since_restore: 176.60977840423584
time_this_iter_s: 16.318893671035767
time_total_s: 176.60977840423584
timestamp: 1683341108
timesteps_since_restore: 0
training_iteration: 10
trial_id: 9812d_00000
warmup_time: 0.00445103645324707
== Status ==
Current time: 2023-05-06 11:45:08 (running for 00:03:00.47)
Memory usage on this node: 12.8/62.7 GiB
Using AsyncHyperBand: num_stopped=10
Bracket: Iter 8.000: -1.1954187115192414 | Iter 4.000: -1.2671232682228089 | Iter 2.000: -1.3616330410957336 | Iter 1.000: -1.6324546621142186
Resources requested: 0/32 CPUs, 0/2 GPUs, 0.0/37.69 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/train_cifar_2023-05-06_11-42-08
Number of trials: 10/10 (10 TERMINATED)
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_9812d_00000 | TERMINATED | 172.17.0.2:16375 | 16 | 16 | 64 | 0.0033252 | 1.20231 | 0.5944 | 10 |
| train_cifar_9812d_00001 | TERMINATED | 172.17.0.2:16423 | 16 | 256 | 4 | 0.000744654 | 2.10313 | 0.227 | 1 |
| train_cifar_9812d_00002 | TERMINATED | 172.17.0.2:16425 | 16 | 32 | 16 | 0.000578039 | 2.06587 | 0.2193 | 1 |
| train_cifar_9812d_00003 | TERMINATED | 172.17.0.2:16426 | 16 | 64 | 256 | 0.00487814 | 1.21558 | 0.5822 | 8 |
| train_cifar_9812d_00004 | TERMINATED | 172.17.0.2:16429 | 8 | 8 | 256 | 0.00547006 | 1.59598 | 0.4336 | 2 |
| train_cifar_9812d_00005 | TERMINATED | 172.17.0.2:16431 | 16 | 256 | 32 | 0.00313045 | 1.3553 | 0.5065 | 2 |
| train_cifar_9812d_00006 | TERMINATED | 172.17.0.2:16433 | 2 | 16 | 32 | 0.0225081 | 2.32232 | 0.0995 | 1 |
| train_cifar_9812d_00007 | TERMINATED | 172.17.0.2:16435 | 2 | 256 | 8 | 0.00253129 | 1.70091 | 0.3814 | 1 |
| train_cifar_9812d_00008 | TERMINATED | 172.17.0.2:16437 | 2 | 64 | 128 | 0.00101019 | 1.36251 | 0.5167 | 2 |
| train_cifar_9812d_00009 | TERMINATED | 172.17.0.2:16439 | 8 | 64 | 4 | 0.000166221 | 2.31116 | 0.099 | 1 |
+-------------------------+------------+------------------+--------------+------+------+-------------+---------+------------+----------------------+
2023-05-06 11:45:08,918 INFO tune.py:747 -- Total run time: 180.82 seconds (180.46 seconds for the tuning loop).
Best trial config: {'l1': 16, 'l2': 64, 'lr': 0.0033252005833185982, 'batch_size': 16}
Best trial final validation loss: 1.2023105457782746
Best trial final validation accuracy: 0.5944
Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.5993
코드를 실행하면 결과는 다음과 같습니다.
Number of trials: 10 (10 TERMINATED)
+-----+------+------+-------------+--------------+---------+------------+--------------------+
| ... | l1 | l2 | lr | batch_size | loss | accuracy | training_iteration |
|-----+------+------+-------------+--------------+---------+------------+--------------------|
| ... | 64 | 4 | 0.00011629 | 2 | 1.87273 | 0.244 | 2 |
| ... | 32 | 64 | 0.000339763 | 8 | 1.23603 | 0.567 | 8 |
| ... | 8 | 16 | 0.00276249 | 16 | 1.1815 | 0.5836 | 10 |
| ... | 4 | 64 | 0.000648721 | 4 | 1.31131 | 0.5224 | 8 |
| ... | 32 | 16 | 0.000340753 | 8 | 1.26454 | 0.5444 | 8 |
| ... | 8 | 4 | 0.000699775 | 8 | 1.99594 | 0.1983 | 2 |
| ... | 256 | 8 | 0.0839654 | 16 | 2.3119 | 0.0993 | 1 |
| ... | 16 | 128 | 0.0758154 | 16 | 2.33575 | 0.1327 | 1 |
| ... | 16 | 8 | 0.0763312 | 16 | 2.31129 | 0.1042 | 4 |
| ... | 128 | 16 | 0.000124903 | 4 | 2.26917 | 0.1945 | 1 |
+-----+------+------+-------------+--------------+---------+------------+--------------------+
Best trial config: {'l1': 8, 'l2': 16, 'lr': 0.00276249, 'batch_size': 16, 'data_dir': '...'}
Best trial final validation loss: 1.181501
Best trial final validation accuracy: 0.5836
Best trial test set accuracy: 0.5806
대부분의 실험은 자원 낭비를 막기 위해 일찍 중단되었습니다. 가장 좋은 결과를 얻은 실험은 58%의 정확도를 달성했으며, 이는 테스트 세트에서 확인할 수 있습니다.
이것이 전부입니다! 이제 파이토치 모델의 매개변수를 조정할 수 있습니다.
Total running time of the script: ( 3 minutes 30.618 seconds)