A00-240 Exam Format | Course Contents | Course Outline | Exam Syllabus | Exam Objectives

This exam is administered by SAS and Pearson VUE.

60 scored multiple-choice and short-answer questions.

(Must achieve score of 68 percent correct to pass)

In addition to the 60 scored items, there may be up to five unscored items.

Two hours to complete exam.

Use exam ID A00-240; required when registering with Pearson VUE.

ANOVA - 10%

Verify the assumptions of ANOVA

Analyze differences between population means using the GLM and TTEST procedures

Perform ANOVA post hoc test to evaluate treatment effect

Detect and analyze interactions between factors

Linear Regression - 20%

Fit a multiple linear regression model using the REG and GLM procedures

Analyze the output of the REG, PLM, and GLM procedures for multiple linear regression models

Use the REG or GLMSELECT procedure to perform model selection

Assess the validity of a given regression model through the use of diagnostic and residual analysis

Logistic Regression - 25%

Perform logistic regression with the LOGISTIC procedure

Optimize model performance through input selection

Interpret the output of the LOGISTIC procedure

Score new data sets using the LOGISTIC and PLM procedures

Prepare Inputs for Predictive Model Performance - 20%

Identify the potential challenges when preparing input data for a model

Use the DATA step to manipulate data with loops, arrays, conditional statements and functions

Improve the predictive power of categorical inputs

Screen variables for irrelevance and non-linear association using the CORR procedure

Screen variables for non-linearity using empirical logit plots

Measure Model Performance - 25%

Apply the principles of honest assessment to model performance measurement

Assess classifier performance using the confusion matrix

Model selection and validation using training and validation data

Create and interpret graphs (ROC, lift, and gains charts) for model comparison and selection

Establish effective decision cut-off values for scoring

Verify the assumptions of ANOVA

Explain the central limit theorem and when it must be applied

Examine the distribution of continuous variables (histogram, box -whisker, Q-Q plots)

Describe the effect of skewness on the normal distribution

Define H0, H1, Type I/II error, statistical power, p-value

Describe the effect of sample size on p-value and power

Interpret the results of hypothesis testing

Interpret histograms and normal probability charts

Draw conclusions about your data from histogram, box-whisker, and Q-Q plots

Identify the kinds of problems may be present in the data: (biased sample, outliers, extreme values)

For a given experiment, verify that the observations are independent

For a given experiment, verify the errors are normally distributed

Use the UNIVARIATE procedure to examine residuals

For a given experiment, verify all groups have equal response variance

Use the HOVTEST option of MEANS statement in PROC GLM to asses response variance

Analyze differences between population means using the GLM and TTEST procedures

Use the GLM Procedure to perform ANOVA

o CLASS statement

o MODEL statement

o MEANS statement

o OUTPUT statement

Evaluate the null hypothesis using the output of the GLM procedure

Interpret the statistical output of the GLM procedure (variance derived from MSE, Fvalue, p-value R**2, Levene's test)

Interpret the graphical output of the GLM procedure

Use the TTEST Procedure to compare means Perform ANOVA post hoc test to evaluate treatment effect

Use the LSMEANS statement in the GLM or PLM procedure to perform pairwise comparisons

Use PDIFF option of LSMEANS statement

Use ADJUST option of the LSMEANS statement (TUKEY and DUNNETT)

Interpret diffograms to evaluate pairwise comparisons

Interpret control plots to evaluate pairwise comparisons

Compare/Contrast use of pairwise T-Tests, Tukey and Dunnett comparison methods Detect and analyze interactions between factors

Use the GLM procedure to produce reports that will help determine the significance of the interaction between factors. MODEL statement

LSMEANS with SLICE=option (Also using PROC PLM)

ODS SELECT

Interpret the output of the GLM procedure to identify interaction between factors:

p-value

F Value

R Squared

TYPE I SS

TYPE III SS

Linear Regression - 20%

Fit a multiple linear regression model using the REG and GLM procedures

Use the REG procedure to fit a multiple linear regression model

Use the GLM procedure to fit a multiple linear regression model

Analyze the output of the REG, PLM, and GLM procedures for multiple linear regression models

Interpret REG or GLM procedure output for a multiple linear regression model:

convert models to algebraic expressions

Convert models to algebraic expressions

Identify missing degrees of freedom

Identify variance due to model/error, and total variance

Calculate a missing F value

Identify variable with largest impact to model

For output from two models, identify which model is better

Identify how much of the variation in the dependent variable is explained by the model

Conclusions that can be drawn from REG, GLM, or PLM output: (about H0, model quality, graphics)

Use the REG or GLMSELECT procedure to perform model selection

Use the SELECTION option of the model statement in the GLMSELECT procedure

Compare the differentmodel selection methods (STEPWISE, FORWARD, BACKWARD)

Enable ODS graphics to display graphs from the REG or GLMSELECT procedure

Identify best models by examining the graphical output (fit criterion from the REG or GLMSELECT procedure)

Assign names to models in the REG procedure (multiple model statements)

Assess the validity of a given regression model through the use of diagnostic and residual analysis

Explain the assumptions for linear regression

From a set of residuals plots, asses which assumption about the error terms has been violated

Use REG procedure MODEL statement options to identify influential observations (Student Residuals, Cook's D, DFFITS, DFBETAS)

Explain options for handling influential observations

Identify collinearity problems by examining REG procedure output

Use MODEL statement options to diagnose collinearity problems (VIF, COLLIN, COLLINOINT)

Logistic Regression - 25%

Perform logistic regression with the LOGISTIC procedure

Identify experiments that require analysis via logistic regression

Identify logistic regression assumptions

logistic regression concepts (log odds, logit transformation, sigmoidal relationship between p and X)

Use the LOGISTIC procedure to fit a binary logistic regression model (MODEL and CLASS statements)

Optimize model performance through input selection

Use the LOGISTIC procedure to fit a multiple logistic regression model

LOGISTIC procedure SELECTION=SCORE option

Perform Model Selection (STEPWISE, FORWARD, BACKWARD) within the LOGISTIC procedure

Interpret the output of the LOGISTIC procedure

Interpret the output from the LOGISTIC procedure for binary logistic regression models: Model Convergence section

Testing Global Null Hypothesis table

Type 3 Analysis of Effects table

Analysis of Maximum Likelihood Estimates table

Association of Predicted Probabilities and Observed Responses

Score new data sets using the LOGISTIC and PLM procedures

Use the SCORE statement in the PLM procedure to score new cases

Use the CODE statement in PROC LOGISTIC to score new data

Describe when you would use the SCORE statement vs the CODE statement in PROC LOGISTIC

Use the INMODEL/OUTMODEL options in PROC LOGISTIC

Explain how to score new data when you have developed a model from a biased sample

Prepare Inputs for Predictive Model

Performance - 20%

Identify the potential challenges when preparing input data for a model

Identify problems that missing values can cause in creating predictive models and scoring new data sets

Identify limitations of Complete Case Analysis

Explain problems caused by categorical variables with numerous levels

Discuss the problem of redundant variables

Discuss the problem of irrelevant and redundant variables

Discuss the non-linearities and the problems they create in predictive models

Discuss outliers and the problems they create in predictive models

Describe quasi-complete separation

Discuss the effect of interactions

Determine when it is necessary to oversample data

Use the DATA step to manipulate data with loops, arrays, conditional statements and functions

Use ARRAYs to create missing indicators

Use ARRAYS, LOOP, IF, and explicit OUTPUT statements

Improve the predictive power of categorical inputs

Reduce the number of levels of a categorical variable

Explain thresholding

Explain Greenacre's method

Cluster the levels of a categorical variable via Greenacre's method using the CLUSTER procedure

o METHOD=WARD option

o FREQ, VAR, ID statement

Use of ODS output to create an output data set

Convert categorical variables to continuous using smooth weight of evidence

Screen variables for irrelevance and non-linear association using the CORR procedure

Explain how Hoeffding's D and Spearman statistics can be used to find irrelevant variables and non-linear associations

Produce Spearman and Hoeffding's D statistic using the CORR procedure (VAR, WITH statement)

Interpret a scatter plot of Hoeffding's D and Spearman statistic to identify irrelevant variables and non-linear associations Screen variables for non-linearity using empirical logit plots

Use the RANK procedure to bin continuous input variables (GROUPS=, OUT= option; VAR, RANK statements)

Interpret RANK procedure output

Use the MEANS procedure to calculate the sum and means for the target cases and total events (NWAY option; CLASS, VAR, OUTPUT statements)

Create empirical logit plots with the SGPLOT procedure

Interpret empirical logit plots

Measure Model Performance - 25%

Apply the principles of honest assessment to model performance measurement

Explain techniques to honestly assess classifier performance

Explain overfitting

Explain differences between validation and test data

Identify the impact of performing data preparation before data is split Assess classifier performance using the confusion matrix

Explain the confusion matrix

Define: Accuracy, Error Rate, Sensitivity, Specificity, PV+, PV-

Explain the effect of oversampling on the confusion matrix

Adjust the confusion matrix for oversampling

Model selection and validation using training and validation data

Divide data into training and validation data sets using the SURVEYSELECT procedure

Discuss the subset selection methods available in PROC LOGISTIC

Discuss methods to determine interactions (forward selection, with bar and @ notation)

Create interaction plot with the results from PROC LOGISTIC

Select the model with fit statistics (BIC, AIC, KS, Brier score)

Create and interpret graphs (ROC, lift, and gains charts) for model comparison and selection

Explain and interpret charts (ROC, Lift, Gains)

Create a ROC curve (OUTROC option of the SCORE statement in the LOGISTIC procedure)

Use the ROC and ROCCONTRAST statements to create an overlay plot of ROC curves for two or more models

Explain the concept of depth as it relates to the gains chart

Establish effective decision cut-off values for scoring

Illustrate a decision rule that maximizes the expected profit

Explain the profit matrix and how to use it to estimate the profit per scored customer

Calculate decision cutoffs using Bayes rule, given a profit matrix

Determine optimum cutoff values from profit plots

Given a profit matrix, and model results, determine the model with the highest average profit

Question #87

What is a benefit to performing data cleansing (imputation, transformations, etc.) on data after partitioning the data for honest assessment as opposed to performing the

data cleansing prior to partitioning the data?

A. It makes inference on the model possible.

B. It is computationally easier and requires less time.

C. It omits the training (and test) data sets from the benefits of the cleansing methods.

D. It allows for the determination of the effectiveness of the cleansing method.

Answer: D

Question #88

A researcher has several variables that could be possible predictors for the final model. There is interest in checking all 2-way interactions for possible entry to the

model. The researcher has decided to use forward selection within PROC LOGISTIC. Fill in the missing code option that will ensure that all 2-way interactions will be

considered for entry.

A. start = 5

B. include = 4

C. include = 5

D. start = 4

Answer: C

Question #89

FILL BLANK -

Refer to the confusion matrix:

An analyst determines that loan defaults occur at the rate of 3% in the overall population. The above confusion matrix is from an oversampled test set (1 = default).

What is the sensitivity adjusted for the population event probability?

Enter your answer in the space below. Round to three decimals (example: n.nnn).

Answer: 0.617

Question #90

Refer to the exhibit:

On the Gains Chart, what is the correct interpretation of the horizontal reference line?

A. the proportion of cases that cannot be classified

B. the probability of a false negative

C. the probability of a false positive

D. the prior event rate

Answer: B

Question #91

Refer to the confusion matrix:

Calculate the accuracy and error rate (0 - negative outcome, 1 - positive outcome)

A. Accuracy = 58/102, Error Rate = 23/48

B. Accuracy = 83/102, Error Rate = 67/102

C. Accuracy = 25/150, Error Rate = 44/150

D. Accuracy = 83/150, Error Rate = 67/150

Answer: A

Question #92

Which statistic is based on the maximum vertical distance between the primary event EDF and the secondary event EDF?

A. KS

B. SBC

C. Max EDF

D. Brier Score

Answer: A

Reference:

https://support.sas.com/documentation/onlinedoc/ets/132/severity.pdf

Question #93

DRAG DROP -

Drag the adjustment formulas for oversamping from the left and place them into the correct location in the confusion matrix shown on the right.

Select and Place:

Answer:

Question #94

An analyst knows that the categorical predictor, zip_code, is an important predictor of a binary target. However, zip_code has too many levels to be a feasible

predictor in a model. The analyst uses PROC CLUSTER to implement Greenacre's method to reduce the number of categorical levels.

What is the correct application of Greenacre's method in this situation?

A. Clustering the levels using the target proportion for each zip_code as input.

B. Clustering the levels using the zip_code values as input.

C. Clustering the levels using the number of cases in each zip_code as input.

D. Clustering the levels using dummy coded zip_code levels as inputs.

Answer: A

Reference:

https://support.sas.com/resources/papers/proceedings/proceedings/sugi31/079-31.pdf

Question #95

What does the Pearson product moment correlation coefficient measure?

A. nonlinear and nonmonotonic association between two variables

B. linear and monotonic association between two variables

C. linear and nonmonotonic association between two variables

D. nonlinear and monotonic association between two variables

Answer: B

Reference:

http://d-scholarship.pitt.edu/8056/1/Chokns_etd2010.pdf

Question #96

This question will ask you to provide a segment of missing code.

The following code is used to create missing value indicator variables for input variables, fred1 to fred7.

Which segment of code would complete the task?

A.

B.

C.

D.

Answer: C

Question #97

This question will ask you to provide a missing option.

Given the following SAS program:

What option must be added to the program to obtain a data set containing Spearman statistics?

A. OUTCORR=estimates

B. OUTS=estimates

C. OUT=estimates

D. OUTPUT=estimates

Answer: D

Question #98

This question will ask you to provide a missing option.

A business analyst is investigating the differences in sales figures across 8 sales regions. The analyst is interested in viewing the regression equation parameter

estimates for each of the design variables.

Which option completes the program to produce the regression equation parameter estimates?

A. Solve

B. Estimate

C. Solution

D. Est

Answer: C

Reference:

https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_ods_examples06.htm&docsetVersion=14.3&locale=en

Question #99

After performing an ANOVA test, an analyst has determined that a significant effect exists due to income. The analyst wants to compare each Income to all others and

wants to control for experimentwise error.

Which GLM procedure statement would provide the most appropriate output?

A. lsmeans Income / pdiff=control adjust=dunnett;

B. lsmeans Income / pdiff=control adjust=t;

C. lsmeans Income / pdiff=all adjust=tukey;

D. lsmeans Income / pdiff=all adjust=t;

Answer: A

Reference:

https://rpubs.com/JsoLab/Stat01_L02

Question #100

SIMULATION -

A linear model has the following characteristics:

*A dependent variable (y)

*One continuous variable (xl), including a quadratic term (x12)

*One categorical (d with 3 levels) predictor variable and an interaction term (d by x1)

How many parameters, including the intercept, are associated with this model?

Enter your numeric answer in the space below. Do not add leading or trailing spaces to your answer.

Answer: 7

A00-240 Model Latest Topics :: Article Creator### New mannequin offers a way to velocity up drug discovery

big libraries of drug compounds can also grasp abilities treatments for lots of ailments, equivalent to cancer or coronary heart ailment. Ideally, scientists would like to experimentally check every of those compounds towards all viable goals, but doing that type of monitor is prohibitively time-drinking.

In contemporary years, researchers have begun the usage of computational tips on how to monitor these libraries in hopes of dashing up drug discovery. despite the fact, many of those methods also take a long time, as most of them calculate every target protein's three-dimensional structure from its amino-acid sequence, then use these buildings to foretell which drug molecules it is going to interact with.

Researchers at MIT and Tufts school have now devised an option computational strategy in line with a kind of synthetic intelligence algorithm known as a huge language mannequin. These fashions -- one general example is ChatGPT -- can analyze massive quantities of textual content and figure out which phrases (or, in this case, amino acids) are undoubtedly to appear collectively. the new mannequin, called ConPLex, can healthy goal proteins with abilities drug molecules without having to function the computationally intensive step of calculating the molecules' buildings.

using this formula, the researchers can monitor greater than one hundred million compounds in a single day -- a good deal more than any present model.

"This work addresses the need for productive and accurate in silico screening of advantage drug candidates, and the scalability of the mannequin permits colossal-scale monitors for assessing off-goal effects, drug repurposing, and selecting the influence of mutations on drug binding," says Bonnie Berger, the Simons Professor of arithmetic, head of the Computation and Biology neighborhood in MIT's computing device Science and synthetic Intelligence Laboratory (CSAIL), and one of the senior authors of the brand new study.

Lenore Cowen, a professor of laptop science at Tufts college, is additionally a senior author of the paper, which seems this week in the court cases of the countrywide Academy of Sciences. Rohit Singh, a CSAIL research scientist, and Samuel Sledzieski, an MIT graduate student, are the lead authors of the paper, and Bryan Bryson, an associate professor of biological engineering at MIT and a member of the Ragon Institute of MGH, MIT, and Harvard, is also an author. apart from the paper, the researchers have made their model attainable on-line for different scientists to use.

Making predictions

In contemporary years, computational scientists have made extraordinary advances in setting up fashions that can predict the buildings of proteins according to their amino-acid sequences. youngsters, the usage of these models to predict how a large library of abilities medicine could have interaction with a cancerous protein, for example, has confirmed difficult, specially because calculating the 3-dimensional constructions of the proteins requires an excellent deal of time and computing vigour.

An further obstacle is that these forms of models won't have a good track checklist for removing compounds known as decoys, which can be very comparable to a successful drug however don't definitely engage smartly with the goal.

"one of the longstanding challenges in the container has been that these methods are fragile, in the sense that if I gave the mannequin a drug or a small molecule that seemed practically like the true issue, nevertheless it became a bit of diverse in some refined means, the model may still predict that they'll have interaction, in spite of the fact that it is going to now not," Singh says.

Researchers have designed fashions that can overcome this sort of fragility, however they're usually tailor-made to only one classification of drug molecules, and they are not well-applicable to tremendous-scale displays since the computations take too long.

The MIT crew determined to take an alternative strategy, in response to a protein model they first developed in 2019. Working with a database of more than 20,000 proteins, the language model encodes this assistance into significant numerical representations of each and every amino-acid sequence that catch associations between sequence and structure.

"With these language fashions, even proteins which have very distinct sequences however potentially have equivalent constructions or similar capabilities will also be represented in an identical method in this language house, and we're able to take talents of that to make our predictions," Sledzieski says.

of their new examine, the researchers applied the protein model to the project of finding out which protein sequences will have interaction with certain drug molecules, each of which have numerical representations which are modified into a common, shared space by a neural community. They educated the community on accepted protein-drug interactions, which allowed it to gain knowledge of to affiliate certain facets of the proteins with drug-binding skill, without having to calculate the 3D constitution of any of the molecules.

"With this splendid numerical illustration, the model can short-circuit the atomic illustration completely, and from these numbers predict whether or no longer this drug will bind," Singh says. "The skills of here's that you avoid the deserve to go through an atomic representation, however the numbers nonetheless have the entire assistance that you need."

one other expertise of this method is that it takes into account the pliability of protein buildings, which will also be "wiggly" and take on a bit of distinctive shapes when interacting with a drug molecule.

excessive affinity

To make their model much less prone to be fooled by decoy drug molecules, the researchers additionally incorporated a working towards stage in keeping with the conception of contrastive discovering. beneath this method, the researchers provide the mannequin examples of "true" drugs and imposters and train it to differentiate between them.

The researchers then validated their mannequin via screening a library of about four,700 candidate drug molecules for his or her ability to bind to a collection of 51 enzymes known as protein kinases.

From the exact hits, the researchers selected 19 drug-protein pairs to examine experimentally. The experiments published that of the 19 hits, 12 had amazing binding affinity (within the nanomolar range), whereas basically the entire many different viable drug-protein pairs would haven't any affinity. four of those pairs bound with extremely high, sub-nanomolar affinity (so potent that a tiny drug attention, on the order of parts per billion, will inhibit the protein).

whereas the researchers focused above all on screening small-molecule medication in this analyze, they are now engaged on applying this method to other sorts of drugs, equivalent to therapeutic antibodies. This variety of modeling could additionally prove effective for working toxicity monitors of skills drug compounds, to be certain they will not have any undesirable facet outcomes earlier than trying out them in animal fashions.

"a part of the reason why drug discovery is so costly is because it has high failure costs. If we are able to reduce those failure quotes by way of saying upfront that this drug isn't prone to determine, that could go an extended way in reducing the can charge of drug discovery," Singh says.

The analysis turned into funded by the national Institutes of health, the country wide Science basis, and the Phillip and Susan Ragon foundation.

## References

Frequently Asked Questions about Killexams Braindumps

