• Tutorials >
  • Ray Tune을 이용한 하이퍼파라미터 튜닝
Shortcuts

Ray Tune을 이용한 하이퍼파라미터 튜닝

번역: 심형준

하이퍼파라미터 튜닝은 보통의 모델과 매우 정확한 모델간의 차이를 만들어 낼 수 있습니다. 종종 다른 학습률(Learnig rate)을 선택하거나 layer size를 변경하는 것과 같은 간단한 작업만으로도 모델 성능에 큰 영향을 미치기도 합니다.

다행히, 최적의 매개변수 조합을 찾는데 도움이 되는 도구가 있습니다. Ray Tune 은 분산 하이퍼파라미터 튜닝을 위한 업계 표준 도구입니다. Ray Tune은 최신 하이퍼파라미터 검색 알고리즘을 포함하고 TensorBoard 및 기타 분석 라이브러리와 통합되며 기본적으로 Ray 의 분산 기계 학습 엔진 을 통해 학습을 지원합니다.

이 튜토리얼은 Ray Tune을 파이토치 학습 workflow에 통합하는 방법을 알려줍니다. CIFAR10 이미지 분류기를 훈련하기 위해 파이토치 문서에서 이 튜토리얼을 확장할 것입니다.

아래와 같이 약간의 수정만 추가하면 됩니다.

  1. 함수에서 데이터 로딩 및 학습 부분을 감싸두고,

  2. 일부 네트워크 파라미터를 구성 가능하게 하고,

  3. 체크포인트를 추가하고 (선택 사항),

  4. 모델 튜닝을 위한 검색 공간을 정의합니다.


이 튜토리얼을 실행하기 위해 아래의 패키지가 설치되어 있는지 확인하세요:

  • ray[tune]: 배포된 하이퍼파라미터 튜닝 라이브러리

  • torchvision: 데이터 변형을 위해 필요

설정 / 불러오기

import들로 시작합니다.

from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler

대부분의 import들은 파이토치 모델을 빌드하는데 필요합니다. 마지막 세 개의 import들만 Ray Tune을 사용하기 위한 것입니다.

Data loaders

data loader를 자체 함수로 감싸두고 전역 데이터 디렉토리로 전달합니다. 이런 식으로 서로 다른 실험들 간에 데이터 디렉토리를 공유할 수 있습니다.

def load_data(data_dir="./data"):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    trainset = torchvision.datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=transform)

    testset = torchvision.datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform)

    return trainset, testset

구성 가능한 신경망

구성 가능한 파라미터만 튜닝이 가능합니다. 이 예시를 통해 fully connected layer 크기를 지정할 수 있습니다:

class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

학습 함수

흥미를 더해보고자 파이토치 문서의 예제 일부를 변경하여 소개합니다.

학습 스크립트를 train_cifar(config, checkpoint_dir=None, data_dir=None) 함수로 감싸둡니다. 짐작할 수 있듯이, config 매개변수는 훈련할 하이퍼파라미터를 받습니다. checkpoint_dir 매개변수는 체크포인트를 복원하는 데 사용됩니다. data_dir 은 데이터를 읽고 저장하는 디렉토리를 지정하므로, 여러 실행들이 동일한 데이터 소스를 공유할 수 있습니다.

net = Net(config["l1"], config["l2"])

if checkpoint_dir:
    model_state, optimizer_state = torch.load(
        os.path.join(checkpoint_dir, "checkpoint"))
    net.load_state_dict(model_state)
    optimizer.load_state_dict(optimizer_state)

또한, 옵티마이저의 학습률(learning rate)을 구성할 수 있습니다.

optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

또한 학습 데이터를 학습 및 검증 세트로 나눕니다. 따라서 데이터의 80%는 모델 학습에 사용하고, 나머지 20%에 대해 유효성 검사 및 손실을 계산합니다. 학습 및 테스트 세트를 반복하는 배치 크기도 구성할 수 있습니다.

DataParallel을 이용한 GPU(다중)지원 추가

이미지 분류는 GPU를 사용할 때 이점이 많습니다. 운좋게도 Ray Tune에서 파이토치의 추상화를 계속 사용할 수 있습니다. 따라서 여러 GPU에서 데이터 병렬 훈련을 지원하기 위해 모델을 nn.DataParallel 으로 감쌀 수 있습니다.

device = "cpu"
if torch.cuda.is_available():
    device = "cuda:0"
    if torch.cuda.device_count() > 1:
        net = nn.DataParallel(net)
net.to(device)

device 변수를 사용하여 사용 가능한 GPU가 없을 때도 학습이 가능한지 확인합니다. 파이토치는 다음과 같이 데이터를 GPU메모리에 명시적으로 보내도록 요구합니다.

for i, data in enumerate(trainloader, 0):
    inputs, labels = data
    inputs, labels = inputs.to(device), labels.to(device)

이 코드는 이제 CPU들, 단일 GPU 및 다중 GPU에 대한 학습을 지원합니다. 특히 Ray는 fractional-GPU 도 지원하므로 모델이 GPU 메모리에 적합한 상황에서는 테스트 간에 GPU를 공유할 수 있습니다. 이는 나중에 다룰 것입니다.

Ray Tune과 소통하기

가장 흥미로운 부분은 Ray Tune과의 소통입니다.

with tune.checkpoint_dir(epoch) as checkpoint_dir:
    path = os.path.join(checkpoint_dir, "checkpoint")
    torch.save((net.state_dict(), optimizer.state_dict()), path)

tune.report(loss=(val_loss / val_steps), accuracy=correct / total)

여기서 먼저 체크포인트를 저장한 다음 일부 메트릭을 Ray Tune에 다시 보냅니다. 특히, validation loss와 accuracy를 Ray Tune으로 다시 보냅니다. 그 후 Ray Tune은 이러한 메트릭을 사용하여 최상의 결과를 유도하는 하이퍼파라미터 구성을 결정할 수 있습니다. 이러한 메트릭들은 또한 리소스 낭비를 방지하기 위해 성능이 좋지 않은 실험을 조기에 중지하는 데 사용할 수 있습니다.

체크포인트 저장은 선택사항이지만 Population Based Training 과 같은 고급 스케줄러를 사용하려면 필요합니다. 또한 체크포인트를 저장하면 나중에 학습된 모델을 로드하고 평가 세트(test set)에서 검증할 수 있습니다.

Full training function

전체 예제 코드는 다음과 같습니다.

def train_cifar(config, checkpoint_dir=None, data_dir=None):
    net = Net(config["l1"], config["l2"])

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    if checkpoint_dir:
        model_state, optimizer_state = torch.load(
            os.path.join(checkpoint_dir, "checkpoint"))
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs])

    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)
    valloader = torch.utils.data.DataLoader(
        val_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)

    for epoch in range(10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                                running_loss / epoch_steps))
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        with tune.checkpoint_dir(epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, "checkpoint")
            torch.save((net.state_dict(), optimizer.state_dict()), path)

        tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
    print("Finished Training")

보다시피, 대부분의 코드는 원본 예제에서 직접 적용되었습니다.

Test set 정확도(accuracy)

일반적으로 머신러닝 모델의 성능은 모델 학습에 사용되지 않은 데이터를 사용해 테스트합니다. Test set 또한 함수로 감싸둘 수 있습니다.

def test_accuracy(net, device="cpu"):
    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2)

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

이 함수는 또한 device 파라미터를 요구하므로, test set 평가를 GPU에서 수행할 수 있습니다.

검색 공간 구성

마지막으로 Ray Tune의 검색 공간을 정의해야 합니다. 예시는 아래와 같습니다.

config = {
    "l1": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
    "l2": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([2, 4, 8, 16])
}

tune.sample_from() 함수를 사용하면 고유한 샘플 방법을 정의하여 하이퍼파라미터를 얻을 수 있습니다. 이 예제에서 l1l2 파라미터는 4와 256 사이의 2의 거듭제곱이어야 하므로 4, 8, 16, 32, 64, 128, 256입니다. lr (학습률)은 0.0001과 0.1 사이에서 균일하게 샘플링 되어야 합니다. 마지막으로, 배치 크기는 2, 4, 8, 16중에서 선택할 수 있습니다.

각 실험에서, Ray Tune은 이제 이러한 검색 공간에서 매개변수 조합을 무작위로 샘플링합니다. 그런 다음 여러 모델을 병렬로 훈련하고 이 중에서 가장 성능이 좋은 모델을 찾습니다. 또한 성능이 좋지 않은 실험을 조기에 종료하는 ASHAScheduler 를 사용합니다.

상수 data_dir 파라미터를 설정하기 위해 functools.partialtrain_cifar 함수를 감싸둡니다. 또한 각 실험에 사용할 수 있는 자원들(resources)을 Ray Tune에 알릴 수 있습니다.

gpus_per_trial = 2
# ...
result = tune.run(
    partial(train_cifar, data_dir=data_dir),
    resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
    config=config,
    num_samples=num_samples,
    scheduler=scheduler,
    progress_reporter=reporter,
    checkpoint_at_end=True)

파이토치 DataLoader 인스턴스의 num_workers 을 늘리기 위해 CPU 수를 지정하고 사용할 수 있습니다. 각 실험에서 선택한 수의 GPU들은 파이토치에 표시됩니다. 실험들은 요청되지 않은 GPU에 액세스할 수 없으므로 같은 자원들을 사용하는 중복된 실험에 대해 신경쓰지 않아도 됩니다.

부분 GPUs를 지정할 수도 있으므로, gpus_per_trial=0.5 와 같은 것 또한 가능합니다. 이후 각 실험은 GPU를 공유합니다. 사용자는 모델이 여전히 GPU메모리에 적합한지만 확인하면 됩니다.

모델을 훈련시킨 후, 가장 성능이 좋은 모델을 찾고 체크포인트 파일에서 학습된 모델을 로드합니다. 이후 test set 정확도(accuracy)를 얻고 모든 것들을 출력하여 확인할 수 있습니다.

전체 주요 기능은 다음과 같습니다.

def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
    data_dir = os.path.abspath("./data")
    load_data(data_dir)
    config = {
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16])
    }
    scheduler = ASHAScheduler(
        metric="loss",
        mode="min",
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)
    reporter = CLIReporter(
        # ``parameter_columns=["l1", "l2", "lr", "batch_size"]``,
        metric_columns=["loss", "accuracy", "training_iteration"])
    result = tune.run(
        partial(train_cifar, data_dir=data_dir),
        resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter)

    best_trial = result.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(
        best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_trial.last_result["accuracy"]))

    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)

    best_checkpoint_dir = best_trial.checkpoint.value
    model_state, optimizer_state = torch.load(os.path.join(
        best_checkpoint_dir, "checkpoint"))
    best_trained_model.load_state_dict(model_state)

    test_acc = test_accuracy(best_trained_model, device)
    print("Best trial test set accuracy: {}".format(test_acc))


if __name__ == "__main__":
    # You can change the number of GPUs per trial here:
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
Files already downloaded and verified
Files already downloaded and verified
2024-03-02 04:43:31,997 WARNING services.py:2002 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-03-02 04:43:34,093 WARNING tune.py:668 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
2024-03-02 04:43:34,263 ERROR syncer.py:147 -- Log sync requires rsync to be installed.
== Status ==
Current time: 2024-03-02 04:43:34 (running for 00:00:00.17)
Memory usage on this node: 9.2/503.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (9 PENDING, 1 RUNNING)
+-------------------------+----------+-----------------+--------------+------+------+-------------+
| Trial name              | status   | loc             |   batch_size |   l1 |   l2 |          lr |
|-------------------------+----------+-----------------+--------------+------+------+-------------|
| train_cifar_6d290_00000 | RUNNING  | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |
| train_cifar_6d290_00001 | PENDING  |                 |            2 |   32 |    4 | 0.0222309   |
| train_cifar_6d290_00002 | PENDING  |                 |           16 |   64 |   64 | 0.00828877  |
| train_cifar_6d290_00003 | PENDING  |                 |            4 |  128 |   64 | 0.00306595  |
| train_cifar_6d290_00004 | PENDING  |                 |            2 |    4 |  128 | 0.000424441 |
| train_cifar_6d290_00005 | PENDING  |                 |            4 |  256 |   16 | 0.028402    |
| train_cifar_6d290_00006 | PENDING  |                 |            2 |  256 |    4 | 0.00202904  |
| train_cifar_6d290_00007 | PENDING  |                 |           16 |    4 |   32 | 0.000212462 |
| train_cifar_6d290_00008 | PENDING  |                 |            2 |   16 |   16 | 0.000831999 |
| train_cifar_6d290_00009 | PENDING  |                 |            8 |   16 |   16 | 0.000102461 |
+-------------------------+----------+-----------------+--------------+------+------+-------------+


(func pid=2826) Files already downloaded and verified
(func pid=2826) Files already downloaded and verified
(func pid=2826) [1,  2000] loss: 2.220
(func pid=2872) Files already downloaded and verified
(func pid=2865) Files already downloaded and verified
(func pid=2876) Files already downloaded and verified
(func pid=2874) Files already downloaded and verified
(func pid=2862) Files already downloaded and verified
(func pid=2864) Files already downloaded and verified
(func pid=2868) Files already downloaded and verified
(func pid=2860) Files already downloaded and verified
(func pid=2870) Files already downloaded and verified
== Status ==
Current time: 2024-03-02 04:43:43 (running for 00:00:09.00)
Memory usage on this node: 11.6/503.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 20.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+-----------------+--------------+------+------+-------------+
| Trial name              | status   | loc             |   batch_size |   l1 |   l2 |          lr |
|-------------------------+----------+-----------------+--------------+------+------+-------------|
| train_cifar_6d290_00000 | RUNNING  | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |
| train_cifar_6d290_00001 | RUNNING  | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |
| train_cifar_6d290_00002 | RUNNING  | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  |
| train_cifar_6d290_00003 | RUNNING  | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  |
| train_cifar_6d290_00004 | RUNNING  | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |
| train_cifar_6d290_00005 | RUNNING  | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    |
| train_cifar_6d290_00006 | RUNNING  | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |
| train_cifar_6d290_00007 | RUNNING  | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 |
| train_cifar_6d290_00008 | RUNNING  | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 |
| train_cifar_6d290_00009 | RUNNING  | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 |
+-------------------------+----------+-----------------+--------------+------+------+-------------+


(func pid=2872) Files already downloaded and verified
(func pid=2865) Files already downloaded and verified
(func pid=2876) Files already downloaded and verified
(func pid=2874) Files already downloaded and verified
(func pid=2862) Files already downloaded and verified
(func pid=2864) Files already downloaded and verified
(func pid=2868) Files already downloaded and verified
(func pid=2860) Files already downloaded and verified
(func pid=2870) Files already downloaded and verified
(func pid=2826) [1,  4000] loss: 1.010
(func pid=2865) [1,  2000] loss: 2.215
(func pid=2874) [1,  2000] loss: 2.262
(func pid=2860) [1,  2000] loss: 2.328
(func pid=2870) [1,  2000] loss: 2.182
(func pid=2864) [1,  2000] loss: 2.124
(func pid=2868) [1,  2000] loss: 2.311
== Status ==
Current time: 2024-03-02 04:43:49 (running for 00:00:15.04)
Memory usage on this node: 15.6/503.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 20.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+-----------------+--------------+------+------+-------------+
| Trial name              | status   | loc             |   batch_size |   l1 |   l2 |          lr |
|-------------------------+----------+-----------------+--------------+------+------+-------------|
| train_cifar_6d290_00000 | RUNNING  | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |
| train_cifar_6d290_00001 | RUNNING  | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |
| train_cifar_6d290_00002 | RUNNING  | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  |
| train_cifar_6d290_00003 | RUNNING  | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  |
| train_cifar_6d290_00004 | RUNNING  | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |
| train_cifar_6d290_00005 | RUNNING  | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    |
| train_cifar_6d290_00006 | RUNNING  | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |
| train_cifar_6d290_00007 | RUNNING  | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 |
| train_cifar_6d290_00008 | RUNNING  | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 |
| train_cifar_6d290_00009 | RUNNING  | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 |
+-------------------------+----------+-----------------+--------------+------+------+-------------+


(func pid=2826) [1,  6000] loss: 0.624
(func pid=2876) [1,  2000] loss: 2.316
(func pid=2872) [1,  2000] loss: 2.304
(func pid=2862) [1,  2000] loss: 1.770
(func pid=2874) [1,  4000] loss: 1.007
(func pid=2865) [1,  4000] loss: 1.022
(func pid=2860) [1,  4000] loss: 1.164
(func pid=2870) [1,  4000] loss: 1.007
(func pid=2826) [1,  8000] loss: 0.436
(func pid=2864) [1,  4000] loss: 0.876
(func pid=2868) [1,  4000] loss: 1.158
== Status ==
Current time: 2024-03-02 04:43:54 (running for 00:00:20.05)
Memory usage on this node: 15.5/503.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 20.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+-----------------+--------------+------+------+-------------+
| Trial name              | status   | loc             |   batch_size |   l1 |   l2 |          lr |
|-------------------------+----------+-----------------+--------------+------+------+-------------|
| train_cifar_6d290_00000 | RUNNING  | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |
| train_cifar_6d290_00001 | RUNNING  | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |
| train_cifar_6d290_00002 | RUNNING  | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  |
| train_cifar_6d290_00003 | RUNNING  | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  |
| train_cifar_6d290_00004 | RUNNING  | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |
| train_cifar_6d290_00005 | RUNNING  | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    |
| train_cifar_6d290_00006 | RUNNING  | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |
| train_cifar_6d290_00007 | RUNNING  | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 |
| train_cifar_6d290_00008 | RUNNING  | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 |
| train_cifar_6d290_00009 | RUNNING  | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 |
+-------------------------+----------+-----------------+--------------+------+------+-------------+


Result for train_cifar_6d290_00007:
  accuracy: 0.0994
  date: 2024-03-02_04-43-54
  done: false
  experiment_id: f19485fe2f634a00b015357467f36795
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 2.2910482563018797
  node_ip: 172.17.0.3
  pid: 2872
  should_checkpoint: true
  time_since_restore: 11.915996551513672
  time_this_iter_s: 11.915996551513672
  time_total_s: 11.915996551513672
  timestamp: 1709354634
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00007
  warmup_time: 0.0019135475158691406

Result for train_cifar_6d290_00002:
  accuracy: 0.4641
  date: 2024-03-02_04-43-54
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 1.4655762907028198
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 12.074471712112427
  time_this_iter_s: 12.074471712112427
  time_total_s: 12.074471712112427
  timestamp: 1709354634
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2876) [1,  4000] loss: 1.153
(func pid=2874) [1,  6000] loss: 0.611
(func pid=2865) [1,  6000] loss: 0.642
(func pid=2860) [1,  6000] loss: 0.776
(func pid=2826) [1, 10000] loss: 0.338
(func pid=2870) [1,  6000] loss: 0.625
(func pid=2864) [1,  6000] loss: 0.542
(func pid=2868) [1,  6000] loss: 0.772
== Status ==
Current time: 2024-03-02 04:43:59 (running for 00:00:25.42)
Memory usage on this node: 15.5/503.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -1.8783122735023499
Resources requested: 20.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (10 RUNNING)
+-------------------------+----------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status   | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+----------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING  | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |         |            |                      |
| train_cifar_6d290_00001 | RUNNING  | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |         |            |                      |
| train_cifar_6d290_00002 | RUNNING  | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.46558 |     0.4641 |                    1 |
| train_cifar_6d290_00003 | RUNNING  | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  |         |            |                      |
| train_cifar_6d290_00004 | RUNNING  | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |         |            |                      |
| train_cifar_6d290_00005 | RUNNING  | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    |         |            |                      |
| train_cifar_6d290_00006 | RUNNING  | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |         |            |                      |
| train_cifar_6d290_00007 | RUNNING  | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.29105 |     0.0994 |                    1 |
| train_cifar_6d290_00008 | RUNNING  | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 |         |            |                      |
| train_cifar_6d290_00009 | RUNNING  | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 |         |            |                      |
+-------------------------+----------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


Result for train_cifar_6d290_00009:
  accuracy: 0.151
  date: 2024-03-02_04-43-59
  done: true
  experiment_id: abeafbfc379643aba2f107b13d81dfea
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 2.2982413373947144
  node_ip: 172.17.0.3
  pid: 2876
  should_checkpoint: true
  time_since_restore: 17.242530584335327
  time_this_iter_s: 17.242530584335327
  time_total_s: 17.242530584335327
  timestamp: 1709354639
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00009
  warmup_time: 0.002002716064453125

(func pid=2874) [1,  8000] loss: 0.431
(func pid=2865) [1,  8000] loss: 0.465
(func pid=2860) [1,  8000] loss: 0.582
(func pid=2826) [1, 12000] loss: 0.276
(func pid=2870) [1,  8000] loss: 0.447
(func pid=2872) [2,  2000] loss: 2.266
(func pid=2862) [2,  2000] loss: 1.423
(func pid=2864) [1,  8000] loss: 0.387
(func pid=2868) [1,  8000] loss: 0.580
(func pid=2874) [1, 10000] loss: 0.329
(func pid=2865) [1, 10000] loss: 0.358
(func pid=2860) [1, 10000] loss: 0.466
(func pid=2826) [1, 14000] loss: 0.230
Result for train_cifar_6d290_00002:
  accuracy: 0.5064
  date: 2024-03-02_04-44-04
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 2
  loss: 1.3893543909072876
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 22.066697120666504
  time_this_iter_s: 9.992225408554077
  time_total_s: 22.066697120666504
  timestamp: 1709354644
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

Result for train_cifar_6d290_00007:
  accuracy: 0.1765
  date: 2024-03-02_04-44-04
  done: true
  experiment_id: f19485fe2f634a00b015357467f36795
  hostname: d4ad428dd9ba
  iterations_since_restore: 2
  loss: 2.169746999549866
  node_ip: 172.17.0.3
  pid: 2872
  should_checkpoint: true
  time_since_restore: 22.03648281097412
  time_this_iter_s: 10.12048625946045
  time_total_s: 22.03648281097412
  timestamp: 1709354644
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: 6d290_00007
  warmup_time: 0.0019135475158691406

== Status ==
Current time: 2024-03-02 04:44:04 (running for 00:00:30.43)
Memory usage on this node: 14.8/503.5 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.779550695228577 | Iter 1.000: -2.2910482563018797
Resources requested: 16.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (8 RUNNING, 2 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |         |            |                      |
| train_cifar_6d290_00001 | RUNNING    | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |         |            |                      |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.38935 |     0.5064 |                    2 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  |         |            |                      |
| train_cifar_6d290_00004 | RUNNING    | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |         |            |                      |
| train_cifar_6d290_00005 | RUNNING    | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    |         |            |                      |
| train_cifar_6d290_00006 | RUNNING    | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |         |            |                      |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 |         |            |                      |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2870) [1, 10000] loss: 0.352
(func pid=2874) [1, 12000] loss: 0.269
(func pid=2864) [1, 10000] loss: 0.302
(func pid=2865) [1, 12000] loss: 0.289
(func pid=2868) [1, 10000] loss: 0.463
(func pid=2860) [1, 12000] loss: 0.387
(func pid=2826) [1, 16000] loss: 0.203
(func pid=2870) [1, 12000] loss: 0.284
== Status ==
Current time: 2024-03-02 04:44:09 (running for 00:00:35.44)
Memory usage on this node: 14.2/503.5 GiB
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.779550695228577 | Iter 1.000: -2.2910482563018797
Resources requested: 16.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (8 RUNNING, 2 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |         |            |                      |
| train_cifar_6d290_00001 | RUNNING    | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |         |            |                      |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.38935 |     0.5064 |                    2 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  |         |            |                      |
| train_cifar_6d290_00004 | RUNNING    | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |         |            |                      |
| train_cifar_6d290_00005 | RUNNING    | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    |         |            |                      |
| train_cifar_6d290_00006 | RUNNING    | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |         |            |                      |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 |         |            |                      |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


Result for train_cifar_6d290_00003:
  accuracy: 0.447
  date: 2024-03-02_04-44-09
  done: false
  experiment_id: 526e99801f96475ab85987a482ea2c11
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 1.5263650244176388
  node_ip: 172.17.0.3
  pid: 2864
  should_checkpoint: true
  time_since_restore: 27.371662378311157
  time_this_iter_s: 27.371662378311157
  time_total_s: 27.371662378311157
  timestamp: 1709354649
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00003
  warmup_time: 0.0019271373748779297

(func pid=2874) [1, 14000] loss: 0.222
(func pid=2865) [1, 14000] loss: 0.237
Result for train_cifar_6d290_00005:
  accuracy: 0.094
  date: 2024-03-02_04-44-10
  done: true
  experiment_id: 5059e143519d4c859627d86b4490084a
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 2.314511362361908
  node_ip: 172.17.0.3
  pid: 2868
  should_checkpoint: true
  time_since_restore: 28.12521982192993
  time_this_iter_s: 28.12521982192993
  time_total_s: 28.12521982192993
  timestamp: 1709354650
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00005
  warmup_time: 0.0015180110931396484

(func pid=2862) [3,  2000] loss: 1.307
(func pid=2826) [1, 18000] loss: 0.176
(func pid=2860) [1, 14000] loss: 0.333
(func pid=2870) [1, 14000] loss: 0.241
(func pid=2874) [1, 16000] loss: 0.194
(func pid=2865) [1, 16000] loss: 0.203
Result for train_cifar_6d290_00002:
  accuracy: 0.5291
  date: 2024-03-02_04-44-13
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 3
  loss: 1.337629676055908
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 31.39844059944153
  time_this_iter_s: 9.331743478775024
  time_total_s: 31.39844059944153
  timestamp: 1709354653
  timesteps_since_restore: 0
  training_iteration: 3
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2864) [2,  2000] loss: 1.455
(func pid=2826) [1, 20000] loss: 0.154
(func pid=2860) [1, 16000] loss: 0.291
(func pid=2870) [1, 16000] loss: 0.204
(func pid=2874) [1, 18000] loss: 0.173
(func pid=2865) [1, 18000] loss: 0.177
(func pid=2864) [2,  4000] loss: 0.718
(func pid=2860) [1, 18000] loss: 0.259
== Status ==
Current time: 2024-03-02 04:44:18 (running for 00:00:44.74)
Memory usage on this node: 13.5/503.5 GiB
Using AsyncHyperBand: num_stopped=3
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.779550695228577 | Iter 1.000: -2.2910482563018797
Resources requested: 14.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (7 RUNNING, 3 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 |         |            |                      |
| train_cifar_6d290_00001 | RUNNING    | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |         |            |                      |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.33763 |     0.5291 |                    3 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.52637 |     0.447  |                    1 |
| train_cifar_6d290_00004 | RUNNING    | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |         |            |                      |
| train_cifar_6d290_00006 | RUNNING    | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |         |            |                      |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 |         |            |                      |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2870) [1, 18000] loss: 0.181
Result for train_cifar_6d290_00000:
  accuracy: 0.4459
  date: 2024-03-02_04-44-19
  done: false
  experiment_id: 78b98e83022a4e3ab2f94e42b7448240
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 1.4989160758927464
  node_ip: 172.17.0.3
  pid: 2826
  should_checkpoint: true
  time_since_restore: 41.62524342536926
  time_this_iter_s: 41.62524342536926
  time_total_s: 41.62524342536926
  timestamp: 1709354659
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00000
  warmup_time: 0.002897977828979492

(func pid=2874) [1, 20000] loss: 0.151
(func pid=2862) [4,  2000] loss: 1.248
(func pid=2865) [1, 20000] loss: 0.158
(func pid=2864) [2,  6000] loss: 0.469
(func pid=2860) [1, 20000] loss: 0.233
(func pid=2870) [1, 20000] loss: 0.163
Result for train_cifar_6d290_00002:
  accuracy: 0.5388
  date: 2024-03-02_04-44-23
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 4
  loss: 1.3011067903518676
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 40.73930835723877
  time_this_iter_s: 9.340867757797241
  time_total_s: 40.73930835723877
  timestamp: 1709354663
  timesteps_since_restore: 0
  training_iteration: 4
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2826) [2,  2000] loss: 1.539
(func pid=2864) [2,  8000] loss: 0.351
Result for train_cifar_6d290_00008:
  accuracy: 0.4683
  date: 2024-03-02_04-44-25
  done: false
  experiment_id: a0a581b251fb48c7b1c0c0007bbaba1d
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 1.4401714820161462
  node_ip: 172.17.0.3
  pid: 2874
  should_checkpoint: true
  time_since_restore: 42.98476052284241
  time_this_iter_s: 42.98476052284241
  time_total_s: 42.98476052284241
  timestamp: 1709354665
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00008
  warmup_time: 0.002081632614135742

== Status ==
Current time: 2024-03-02 04:44:25 (running for 00:00:51.31)
Memory usage on this node: 13.4/503.5 GiB
Using AsyncHyperBand: num_stopped=3
Bracket: Iter 8.000: None | Iter 4.000: -1.3011067903518676 | Iter 2.000: -1.779550695228577 | Iter 1.000: -1.5263650244176388
Resources requested: 14.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (7 RUNNING, 3 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.49892 |     0.4459 |                    1 |
| train_cifar_6d290_00001 | RUNNING    | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   |         |            |                      |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.30111 |     0.5388 |                    4 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.52637 |     0.447  |                    1 |
| train_cifar_6d290_00004 | RUNNING    | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 |         |            |                      |
| train_cifar_6d290_00006 | RUNNING    | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  |         |            |                      |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.44017 |     0.4683 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


Result for train_cifar_6d290_00004:
  accuracy: 0.4175
  date: 2024-03-02_04-44-25
  done: true
  experiment_id: 3cd2a7ade2704e4b80d3c2aed32ca500
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 1.5434601649925113
  node_ip: 172.17.0.3
  pid: 2865
  should_checkpoint: true
  time_since_restore: 43.28671646118164
  time_this_iter_s: 43.28671646118164
  time_total_s: 43.28671646118164
  timestamp: 1709354665
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00004
  warmup_time: 0.0018148422241210938

(func pid=2826) [2,  4000] loss: 0.766
Result for train_cifar_6d290_00001:
  accuracy: 0.0977
  date: 2024-03-02_04-44-27
  done: true
  experiment_id: 6e5411bf9e8d49db9dc507ad2785cea1
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 2.3108368701934814
  node_ip: 172.17.0.3
  pid: 2860
  should_checkpoint: true
  time_since_restore: 44.807891607284546
  time_this_iter_s: 44.807891607284546
  time_total_s: 44.807891607284546
  timestamp: 1709354667
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00001
  warmup_time: 0.0018398761749267578

Result for train_cifar_6d290_00006:
  accuracy: 0.4197
  date: 2024-03-02_04-44-28
  done: true
  experiment_id: 5f2a2d03d9814a59a271e8b791d66271
  hostname: d4ad428dd9ba
  iterations_since_restore: 1
  loss: 1.6034393922151997
  node_ip: 172.17.0.3
  pid: 2870
  should_checkpoint: true
  time_since_restore: 45.63483643531799
  time_this_iter_s: 45.63483643531799
  time_total_s: 45.63483643531799
  timestamp: 1709354668
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 6d290_00006
  warmup_time: 0.0017707347869873047

(func pid=2874) [2,  2000] loss: 1.481
(func pid=2864) [2, 10000] loss: 0.277
(func pid=2862) [5,  2000] loss: 1.211
(func pid=2826) [2,  6000] loss: 0.506
Result for train_cifar_6d290_00003:
  accuracy: 0.5222
  date: 2024-03-02_04-44-31
  done: false
  experiment_id: 526e99801f96475ab85987a482ea2c11
  hostname: d4ad428dd9ba
  iterations_since_restore: 2
  loss: 1.350230276836455
  node_ip: 172.17.0.3
  pid: 2864
  should_checkpoint: true
  time_since_restore: 48.89120292663574
  time_this_iter_s: 21.519540548324585
  time_total_s: 48.89120292663574
  timestamp: 1709354671
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: 6d290_00003
  warmup_time: 0.0019271373748779297

== Status ==
Current time: 2024-03-02 04:44:31 (running for 00:00:57.31)
Memory usage on this node: 11.5/503.5 GiB
Using AsyncHyperBand: num_stopped=6
Bracket: Iter 8.000: None | Iter 4.000: -1.3011067903518676 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 8.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (4 RUNNING, 6 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.49892 |     0.4459 |                    1 |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.30111 |     0.5388 |                    4 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.35023 |     0.5222 |                    2 |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.44017 |     0.4683 |                    1 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


Result for train_cifar_6d290_00002:
  accuracy: 0.5661
  date: 2024-03-02_04-44-31
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 5
  loss: 1.2768983849048614
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 49.19853448867798
  time_this_iter_s: 8.459226131439209
  time_total_s: 49.19853448867798
  timestamp: 1709354671
  timesteps_since_restore: 0
  training_iteration: 5
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2874) [2,  4000] loss: 0.739
(func pid=2826) [2,  8000] loss: 0.377
(func pid=2874) [2,  6000] loss: 0.486
(func pid=2864) [3,  2000] loss: 1.326
(func pid=2826) [2, 10000] loss: 0.303
== Status ==
Current time: 2024-03-02 04:44:36 (running for 00:01:02.54)
Memory usage on this node: 11.5/503.5 GiB
Using AsyncHyperBand: num_stopped=6
Bracket: Iter 8.000: None | Iter 4.000: -1.3011067903518676 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 8.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (4 RUNNING, 6 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.49892 |     0.4459 |                    1 |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.2769  |     0.5661 |                    5 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.35023 |     0.5222 |                    2 |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.44017 |     0.4683 |                    1 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2862) [6,  2000] loss: 1.176
(func pid=2874) [2,  8000] loss: 0.353
(func pid=2826) [2, 12000] loss: 0.241
(func pid=2864) [3,  4000] loss: 0.665
Result for train_cifar_6d290_00002:
  accuracy: 0.5583
  date: 2024-03-02_04-44-39
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 6
  loss: 1.2511892985343933
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 57.02598834037781
  time_this_iter_s: 7.827453851699829
  time_total_s: 57.02598834037781
  timestamp: 1709354679
  timesteps_since_restore: 0
  training_iteration: 6
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2874) [2, 10000] loss: 0.283
(func pid=2826) [2, 14000] loss: 0.211
(func pid=2864) [3,  6000] loss: 0.443
(func pid=2874) [2, 12000] loss: 0.240
(func pid=2826) [2, 16000] loss: 0.181
== Status ==
Current time: 2024-03-02 04:44:44 (running for 00:01:10.38)
Memory usage on this node: 11.5/503.5 GiB
Using AsyncHyperBand: num_stopped=6
Bracket: Iter 8.000: None | Iter 4.000: -1.3011067903518676 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 8.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (4 RUNNING, 6 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.49892 |     0.4459 |                    1 |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.25119 |     0.5583 |                    6 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.35023 |     0.5222 |                    2 |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.44017 |     0.4683 |                    1 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2862) [7,  2000] loss: 1.150
(func pid=2864) [3,  8000] loss: 0.337
(func pid=2874) [2, 14000] loss: 0.202
(func pid=2826) [2, 18000] loss: 0.163
Result for train_cifar_6d290_00002:
  accuracy: 0.5316
  date: 2024-03-02_04-44-47
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 7
  loss: 1.3719351972579956
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 64.638662815094
  time_this_iter_s: 7.6126744747161865
  time_total_s: 64.638662815094
  timestamp: 1709354687
  timesteps_since_restore: 0
  training_iteration: 7
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2864) [3, 10000] loss: 0.265
(func pid=2874) [2, 16000] loss: 0.176
(func pid=2826) [2, 20000] loss: 0.145
Result for train_cifar_6d290_00003:
  accuracy: 0.5269
  date: 2024-03-02_04-44-50
  done: false
  experiment_id: 526e99801f96475ab85987a482ea2c11
  hostname: d4ad428dd9ba
  iterations_since_restore: 3
  loss: 1.3348135735213758
  node_ip: 172.17.0.3
  pid: 2864
  should_checkpoint: true
  time_since_restore: 68.16448998451233
  time_this_iter_s: 19.273287057876587
  time_total_s: 68.16448998451233
  timestamp: 1709354690
  timesteps_since_restore: 0
  training_iteration: 3
  trial_id: 6d290_00003
  warmup_time: 0.0019271373748779297

== Status ==
Current time: 2024-03-02 04:44:50 (running for 00:01:16.57)
Memory usage on this node: 11.5/503.5 GiB
Using AsyncHyperBand: num_stopped=6
Bracket: Iter 8.000: None | Iter 4.000: -1.3011067903518676 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 8.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (4 RUNNING, 6 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | RUNNING    | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.49892 |     0.4459 |                    1 |
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.37194 |     0.5316 |                    7 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33481 |     0.5269 |                    3 |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.44017 |     0.4683 |                    1 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [2, 18000] loss: 0.155
(func pid=2862) [8,  2000] loss: 1.127
Result for train_cifar_6d290_00000:
  accuracy: 0.4412
  date: 2024-03-02_04-44-53
  done: true
  experiment_id: 78b98e83022a4e3ab2f94e42b7448240
  hostname: d4ad428dd9ba
  iterations_since_restore: 2
  loss: 1.5479773554574698
  node_ip: 172.17.0.3
  pid: 2826
  should_checkpoint: true
  time_since_restore: 75.81422781944275
  time_this_iter_s: 34.188984394073486
  time_total_s: 75.81422781944275
  timestamp: 1709354693
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: 6d290_00000
  warmup_time: 0.002897977828979492

(func pid=2874) [2, 20000] loss: 0.140
(func pid=2864) [4,  2000] loss: 1.267
Result for train_cifar_6d290_00002:
  accuracy: 0.5409
  date: 2024-03-02_04-44-54
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 8
  loss: 1.316029909992218
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 72.55095863342285
  time_this_iter_s: 7.912295818328857
  time_total_s: 72.55095863342285
  timestamp: 1709354694
  timesteps_since_restore: 0
  training_iteration: 8
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2864) [4,  4000] loss: 0.638
Result for train_cifar_6d290_00008:
  accuracy: 0.5256
  date: 2024-03-02_04-44-58
  done: false
  experiment_id: a0a581b251fb48c7b1c0c0007bbaba1d
  hostname: d4ad428dd9ba
  iterations_since_restore: 2
  loss: 1.3147970693634823
  node_ip: 172.17.0.3
  pid: 2874
  should_checkpoint: true
  time_since_restore: 76.2469220161438
  time_this_iter_s: 33.26216149330139
  time_total_s: 76.2469220161438
  timestamp: 1709354698
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: 6d290_00008
  warmup_time: 0.002081632614135742

== Status ==
Current time: 2024-03-02 04:44:58 (running for 00:01:24.58)
Memory usage on this node: 10.9/503.5 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3011067903518676 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 6.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.31603 |     0.5409 |                    8 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33481 |     0.5269 |                    3 |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.3148  |     0.5256 |                    2 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2862) [9,  2000] loss: 1.109
(func pid=2864) [4,  6000] loss: 0.431
(func pid=2874) [3,  2000] loss: 1.347
Result for train_cifar_6d290_00002:
  accuracy: 0.578
  date: 2024-03-02_04-45-02
  done: false
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 9
  loss: 1.26980535364151
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 79.8222668170929
  time_this_iter_s: 7.271308183670044
  time_total_s: 79.8222668170929
  timestamp: 1709354702
  timesteps_since_restore: 0
  training_iteration: 9
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2864) [4,  8000] loss: 0.325
(func pid=2874) [3,  4000] loss: 0.678
(func pid=2864) [4, 10000] loss: 0.262
(func pid=2874) [3,  6000] loss: 0.450
== Status ==
Current time: 2024-03-02 04:45:07 (running for 00:01:33.17)
Memory usage on this node: 10.9/503.5 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3011067903518676 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 6.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00002 | RUNNING    | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.26981 |     0.578  |                    9 |
| train_cifar_6d290_00003 | RUNNING    | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33481 |     0.5269 |                    3 |
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.3148  |     0.5256 |                    2 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2862) [10,  2000] loss: 1.096
Result for train_cifar_6d290_00003:
  accuracy: 0.535
  date: 2024-03-02_04-45-09
  done: true
  experiment_id: 526e99801f96475ab85987a482ea2c11
  hostname: d4ad428dd9ba
  iterations_since_restore: 4
  loss: 1.3346445262908935
  node_ip: 172.17.0.3
  pid: 2864
  should_checkpoint: true
  time_since_restore: 86.90077805519104
  time_this_iter_s: 18.73628807067871
  time_total_s: 86.90077805519104
  timestamp: 1709354709
  timesteps_since_restore: 0
  training_iteration: 4
  trial_id: 6d290_00003
  warmup_time: 0.0019271373748779297

Result for train_cifar_6d290_00002:
  accuracy: 0.5812
  date: 2024-03-02_04-45-09
  done: true
  experiment_id: abdb7359c85347ca98f33663322aa215
  hostname: d4ad428dd9ba
  iterations_since_restore: 10
  loss: 1.2347397193431855
  node_ip: 172.17.0.3
  pid: 2862
  should_checkpoint: true
  time_since_restore: 87.27773475646973
  time_this_iter_s: 7.455467939376831
  time_total_s: 87.27773475646973
  timestamp: 1709354709
  timesteps_since_restore: 0
  training_iteration: 10
  trial_id: 6d290_00002
  warmup_time: 0.001991748809814453

(func pid=2874) [3,  8000] loss: 0.347
(func pid=2874) [3, 10000] loss: 0.270
== Status ==
Current time: 2024-03-02 04:45:14 (running for 00:01:40.62)
Memory usage on this node: 9.6/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.3148  |     0.5256 |                    2 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [3, 12000] loss: 0.223
(func pid=2874) [3, 14000] loss: 0.193
== Status ==
Current time: 2024-03-02 04:45:19 (running for 00:01:45.63)
Memory usage on this node: 9.6/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.3148  |     0.5256 |                    2 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [3, 16000] loss: 0.170
(func pid=2874) [3, 18000] loss: 0.150
== Status ==
Current time: 2024-03-02 04:45:24 (running for 00:01:50.64)
Memory usage on this node: 9.6/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.3148  |     0.5256 |                    2 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [3, 20000] loss: 0.129
Result for train_cifar_6d290_00008:
  accuracy: 0.5084
  date: 2024-03-02_04-45-29
  done: false
  experiment_id: a0a581b251fb48c7b1c0c0007bbaba1d
  hostname: d4ad428dd9ba
  iterations_since_restore: 3
  loss: 1.3627182557277382
  node_ip: 172.17.0.3
  pid: 2874
  should_checkpoint: true
  time_since_restore: 106.62963962554932
  time_this_iter_s: 30.382717609405518
  time_total_s: 106.62963962554932
  timestamp: 1709354729
  timesteps_since_restore: 0
  training_iteration: 3
  trial_id: 6d290_00008
  warmup_time: 0.002081632614135742

(func pid=2874) [4,  2000] loss: 1.302
== Status ==
Current time: 2024-03-02 04:45:34 (running for 00:01:59.96)
Memory usage on this node: 9.6/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.36272 |     0.5084 |                    3 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [4,  4000] loss: 0.647
(func pid=2874) [4,  6000] loss: 0.430
== Status ==
Current time: 2024-03-02 04:45:39 (running for 00:02:04.97)
Memory usage on this node: 9.5/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.36272 |     0.5084 |                    3 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [4,  8000] loss: 0.331
(func pid=2874) [4, 10000] loss: 0.266
== Status ==
Current time: 2024-03-02 04:45:44 (running for 00:02:09.97)
Memory usage on this node: 9.6/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.36272 |     0.5084 |                    3 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [4, 12000] loss: 0.220
(func pid=2874) [4, 14000] loss: 0.187
== Status ==
Current time: 2024-03-02 04:45:49 (running for 00:02:14.98)
Memory usage on this node: 9.6/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.36272 |     0.5084 |                    3 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [4, 16000] loss: 0.164
(func pid=2874) [4, 18000] loss: 0.142
== Status ==
Current time: 2024-03-02 04:45:54 (running for 00:02:19.98)
Memory usage on this node: 9.6/503.5 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3178756583213804 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 2.0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (1 RUNNING, 9 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00008 | RUNNING    | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.36272 |     0.5084 |                    3 |
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


(func pid=2874) [4, 20000] loss: 0.127
Result for train_cifar_6d290_00008:
  accuracy: 0.5163
  date: 2024-03-02_04-45-58
  done: true
  experiment_id: a0a581b251fb48c7b1c0c0007bbaba1d
  hostname: d4ad428dd9ba
  iterations_since_restore: 4
  loss: 1.4185918609520887
  node_ip: 172.17.0.3
  pid: 2874
  should_checkpoint: true
  time_since_restore: 136.0467689037323
  time_this_iter_s: 29.417129278182983
  time_total_s: 136.0467689037323
  timestamp: 1709354758
  timesteps_since_restore: 0
  training_iteration: 4
  trial_id: 6d290_00008
  warmup_time: 0.002081632614135742

== Status ==
Current time: 2024-03-02 04:45:58 (running for 00:02:24.38)
Memory usage on this node: 9.4/503.5 GiB
Using AsyncHyperBand: num_stopped=10
Bracket: Iter 8.000: -1.316029909992218 | Iter 4.000: -1.3346445262908935 | Iter 2.000: -1.3893543909072876 | Iter 1.000: -1.5734497786038555
Resources requested: 0/32 CPUs, 0/4 GPUs, 0.0/481.66 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /workspace/ray_results/train_cifar_2024-03-02_04-43-34
Number of trials: 10/10 (10 TERMINATED)
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name              | status     | loc             |   batch_size |   l1 |   l2 |          lr |    loss |   accuracy |   training_iteration |
|-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| train_cifar_6d290_00000 | TERMINATED | 172.17.0.3:2826 |            2 |    4 |  256 | 0.000668648 | 1.54798 |     0.4412 |                    2 |
| train_cifar_6d290_00001 | TERMINATED | 172.17.0.3:2860 |            2 |   32 |    4 | 0.0222309   | 2.31084 |     0.0977 |                    1 |
| train_cifar_6d290_00002 | TERMINATED | 172.17.0.3:2862 |           16 |   64 |   64 | 0.00828877  | 1.23474 |     0.5812 |                   10 |
| train_cifar_6d290_00003 | TERMINATED | 172.17.0.3:2864 |            4 |  128 |   64 | 0.00306595  | 1.33464 |     0.535  |                    4 |
| train_cifar_6d290_00004 | TERMINATED | 172.17.0.3:2865 |            2 |    4 |  128 | 0.000424441 | 1.54346 |     0.4175 |                    1 |
| train_cifar_6d290_00005 | TERMINATED | 172.17.0.3:2868 |            4 |  256 |   16 | 0.028402    | 2.31451 |     0.094  |                    1 |
| train_cifar_6d290_00006 | TERMINATED | 172.17.0.3:2870 |            2 |  256 |    4 | 0.00202904  | 1.60344 |     0.4197 |                    1 |
| train_cifar_6d290_00007 | TERMINATED | 172.17.0.3:2872 |           16 |    4 |   32 | 0.000212462 | 2.16975 |     0.1765 |                    2 |
| train_cifar_6d290_00008 | TERMINATED | 172.17.0.3:2874 |            2 |   16 |   16 | 0.000831999 | 1.41859 |     0.5163 |                    4 |
| train_cifar_6d290_00009 | TERMINATED | 172.17.0.3:2876 |            8 |   16 |   16 | 0.000102461 | 2.29824 |     0.151  |                    1 |
+-------------------------+------------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+


2024-03-02 04:45:58,580 INFO tune.py:747 -- Total run time: 144.54 seconds (144.38 seconds for the tuning loop).
Best trial config: {'l1': 64, 'l2': 64, 'lr': 0.008288769137470064, 'batch_size': 16}
Best trial final validation loss: 1.2347397193431855
Best trial final validation accuracy: 0.5812
Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.5825

코드를 실행하면 결과는 다음과 같습니다.

Number of trials: 10 (10 TERMINATED)
+-----+------+------+-------------+--------------+---------+------------+--------------------+
| ... |   l1 |   l2 |          lr |   batch_size |    loss |   accuracy | training_iteration |
|-----+------+------+-------------+--------------+---------+------------+--------------------|
| ... |   64 |    4 | 0.00011629  |            2 | 1.87273 |     0.244  |                  2 |
| ... |   32 |   64 | 0.000339763 |            8 | 1.23603 |     0.567  |                  8 |
| ... |    8 |   16 | 0.00276249  |           16 | 1.1815  |     0.5836 |                 10 |
| ... |    4 |   64 | 0.000648721 |            4 | 1.31131 |     0.5224 |                  8 |
| ... |   32 |   16 | 0.000340753 |            8 | 1.26454 |     0.5444 |                  8 |
| ... |    8 |    4 | 0.000699775 |            8 | 1.99594 |     0.1983 |                  2 |
| ... |  256 |    8 | 0.0839654   |           16 | 2.3119  |     0.0993 |                  1 |
| ... |   16 |  128 | 0.0758154   |           16 | 2.33575 |     0.1327 |                  1 |
| ... |   16 |    8 | 0.0763312   |           16 | 2.31129 |     0.1042 |                  4 |
| ... |  128 |   16 | 0.000124903 |            4 | 2.26917 |     0.1945 |                  1 |
+-----+------+------+-------------+--------------+---------+------------+--------------------+


Best trial config: {'l1': 8, 'l2': 16, 'lr': 0.00276249, 'batch_size': 16, 'data_dir': '...'}
Best trial final validation loss: 1.181501
Best trial final validation accuracy: 0.5836
Best trial test set accuracy: 0.5806

대부분의 실험은 자원 낭비를 막기 위해 일찍 중단되었습니다. 가장 좋은 결과를 얻은 실험은 58%의 정확도를 달성했으며, 이는 테스트 세트에서 확인할 수 있습니다.

이것이 전부입니다! 이제 파이토치 모델의 매개변수를 조정할 수 있습니다.

Total running time of the script: ( 2 minutes 36.529 seconds)

Gallery generated by Sphinx-Gallery


더 궁금하시거나 개선할 내용이 있으신가요? 커뮤니티에 참여해보세요!


이 튜토리얼이 어떠셨나요? 평가해주시면 이후 개선에 참고하겠습니다! :)

© Copyright 2018-2023, PyTorch & 파이토치 한국 사용자 모임(PyTorch Korea User Group).

Built with Sphinx using a theme provided by Read the Docs.

PyTorchKorea @ GitHub

파이토치 한국 사용자 모임을 GitHub에서 만나보세요.

GitHub로 이동

한국어 튜토리얼

한국어로 번역 중인 PyTorch 튜토리얼입니다.

튜토리얼로 이동

커뮤니티

다른 사용자들과 의견을 나누고, 도와주세요!

커뮤니티로 이동