The specification of potentially confounding baseline features or covariates is a crucial step in the design of prospective and retrospective clinical trials. Baseline features are critical for ensuring the integrity of the study design, the validity of the results, and the generalizability of the findings. We introduce CTSuggest, an application leveraging large language models (LLMs) to generate baseline features as part of the clinical trial design process. Users first specify basic trial metadata, and then CTSuggest suggests appropriate features with an explanation for each feature. Users can create an entirely new trial or start with metadata from an existing trial from ClinicalTrials.gov. We perform experiments validating the quality of CTSuggest’s baseline features using the benchmark CT-Pub dataset taken from clinical trial publications and evaluating using the “LLM-as-a-Judge” (LaaJ) framework provided in the CTBench benchmark. The results show that the feature suggestions generated by the state-of-the-art GPT-4o model meet or exceed the previously published CTBench results. We also show the promise of using a smaller open-source Llama model. Additionally, we examine the reliability of LaaJ evaluation within this setting. Coherence checking revealed hallucinations in the LaaJ’s evaluation, necessitating a postprocessing correction step that yielded lower but more accurate performance metrics. Three different types of hallucination were observed. The hallucination rate provides a quantifiable coherence metric that can be systematically used to improve LaaJ reliability. Our findings underscore the challenges in developing reliable LLM evaluation methods in healthcare applications and demonstrate a potential framework for improving LaaJ systems.