UE4 Redundant Asset Detection Method Based on Pointer Analysis

Tingzhen Liu

doi:10.20944/preprints202409.1236.v1

Submitted:

14 September 2024

Posted:

18 September 2024

You are already at the latest version

Abstract

In the process of game development, identifying and deleting currently abandoned assets as content iterates can effectively reduce the size of game packages. But the built-in reference checking tool in UE4 can only check static references of assets and cannot identify dynamically referenced resources in the program. We have developed a static analysis tool for analyzing unused assets in the UE4 project to address this issue. This tool checks the function parameters of all loading points, analyzes the value range of the parameter string based on the data dependency relationship of the actual parameter variable, and considers that the assets that match within the range are referenced. Due to the fact that in some cases, the exact data stream of the actual parameter variable is not computable (resulting in false negatives), the reverse analysis tool supports manually marking the parameter range at the loading point. This tool can generate asset collections that are dynamically referenced. The union of its results with the built-in reference checking tool in UE4 is the set of all referenced assets. The difference set between all asset sets and the referenced asset set is the abandoned asset set. This achieves a more complete cleaning of redundant resources and reduces the size of game packages.

Keywords:

program analysis

;

static analysis

;

pointer analysis

;

data flow analysis

;

flow sensitivity

;

game develop

Subject:

Computer Science and Mathematics - Software

Method

Our static analyzer checks the function parameters of all WBP loading points. Analyze the value range of the variable based on the data dependency relationship of the actual parameter variable [1]. It approximates as closely as possible all the possible literal values that the variable at the loading point may receive. To achieve this goal, we constructed a two-stage approach: the first stage collects all the variable sets that the variable depends on (if the variable is assigned values by other variables), and the second stage collects the literal assignments of all variables in the variable set.

Step 1: The symbols tracked by algorithms are divided into three categories: ordinary variables, array types, and record types. The algorithm starts from the analysis of using symbol

s

and constructs a data flow graph for all three types of symbols that symbol

s

depends on.

Suppose there is

a

symbol a in the data flow graph

G

. The rule for adding elements to

G

for array types is:

\frac{Γ ⊢ a Γ ⊢ b : A r r a y Γ ⊢ (a : = b [M])}{a d d a \to b t o G} A r r a y E d g e A d d A

\frac{Γ ⊢ a : A r r a y Γ ⊢ b Γ ⊢ (a . i n s e r t (b))}{a d d a \to b t o G} A r r a y E d g e A d d B

The rule for adding elements to

G

for record types is:

\frac{Γ ⊢ a Γ ⊢ b : R e c o r d Γ ⊢ (a : = b . c)}{a d d a \to b . c t o G} R e c o r d E d g e A d d

Both

b

and

b . c

are considered elements in

G

.

For any symbol, there are rules:

\frac{Γ ⊢ a Γ ⊢ b Γ ⊢ (a : = b)}{a d d a \to b t o G} B a s i c E d g e A d d

The algorithm first adds the symbol

s

to

G

, and then analyzes the code upwards, applying three types of rules until

G

can no longer be expanded.

For function calls, there are rules:

\frac{\begin{matrix} Γ ⊢ a Γ ⊢ F (a, \dots \dots) \\ F (A) ⊢ b F (a, \dots \dots) ⊢ (a : = b) \end{matrix}}{\begin{matrix} D o E d g e A d d (F (A) a s Γ, G^{'} a s G) \\ a d d a \underset{F (A_{A r g u m e n t}) b y C a l l A}{\to} A_{A r g u m e n t} / a t o G \end{matrix}} C a l l E d g e A d d A

Note : F (a, \dots \dots) is equivalent to F (A_{A r g u m e n t}) that represents the actual call point . Among them, a \in A_{A r g u m e n t}

This rule states that if symbol

a

is modified in function

F

, and the modification depends on symbol

b

in the function. Then construct a separate data flow graph

G^{'}

for

F

. Then add it to the original data flow graph

G

that the edge from symbol

a

to the other arguments passed to

F

. This edge contains call point information, that is, the corresponding relationship between these arguments and formal parameters of

F

.

There are rules for modifying symbol

a

by return value:

\frac{Γ ⊢ a Γ ⊢ (a : = F (A_{A r g u m e n t}))}{\begin{matrix} D o E d g e A d d (F (A) a s Γ, G^{'} a s G) \\ a d d a \underset{F (A_{A r g u m e n t}) b y C a l l B}{\to} A t o G \end{matrix}} C a l l E d g e A d d B

This rule adds it to the original data flow graph

G

that the edge from symbol

a

to all arguments passed to

F

(assuming

a \notin A_{A r g u m e n t}

).

Afterwards, the algorithm replaces the

C a l l A

and

C a l l B

edges in

G

with the data flow graph

G^{'}

of

F

:

\frac{\forall e : E d g e b y C a l l A i n G}{D o R e p l a c e e t o G^{'} (A_{A r g u m e n t} a s A)} C a l l A E d g e R e p l a c e

This rule replaces

A

(several formal parameter nodes) in

G^{'}

with several arguments nodes (i.e.

A_{A r g u m e n t}

in

a \underset{F (A_{A r g u m e n t}) b y C a l l A}{\to} A_{A r g u m e n t} / a

) recorded in edge

e

. This allows it to connect with the existing nodes in

G

.

\frac{\forall e : E d g e b y C a l l B i n G F (A) ⊢ (r e t u r n r)}{D o R e p l a c e e t o G^{'} (A_{A r g u m e n t} a s A, a a s r)} C a l l B E d g e R e p l a c e

In this rule, since the symbol

a

is modified by the return value of

F

, it is also necessary to replace the symbol node returned in

G^{'}

with the symbol

a

. This allows it to connect with the existing nodes in

G

.

Step 2: The algorithm collects all the statements in

G

that have been assigned literal values:

\frac{G ⊢ r o o t G ⊢ a G ⊢ (r o o t ⇀ a) Γ ⊢ (a : = M)}{a d d M t o R}

Among them,

r o o t

is the starting point for analysis,

r o o t ⇀ a

represents that

r o o t

is connected to

a

.

R

is the set of possible literal values that the analyzed symbol

a

may receive.

Example

As a code chip: Preprints 118236 g001

The argument BpName of the loading point LoadComponent is the symbol to be analyzed. Looking back, it can be found that BpName has two modification points, both of which are assignment statements. Therefore, the data flow graph

G

becomes:

After scanning all the modified points of BpName, scan the newly added symbols in the data flow graph. If there is code: Preprints 118236 g003

Then the data flow graph

G

becomes:

If all symbols have been scanned and no new symbols have been added, then the range of literal values for all symbols will be merged. The possible values for BpName will be WBP_CommonSelect, WBP_ShareMenuPopup, WBP_SkillInfoIntensify.

From this example, it can be seen that due to the trade-off between the speed and accuracy of the analysis algorithm, we must choose appropriate control flow and data flow abstractions to avoid difficult to handle calculations. To achieve this goal, our algorithm over-approximates the exact set of values that its parameters may have. Therefore, only flow sensitive analysis methods [2] are used. In this example, we do not calculate whether self.bIsOpt cannot be true (if it is always false, we can exclude one value), but instead adopt all possible paths and contexts that can expand the set of values.

For the analysis of function calls, such as code chip: Preprints 118236 g005

The function replaceInfo indirectly modifies self.CustomParamsInfo.BpName by modifying self.CustomParamsInfo. So according to the rule

C a l l E d g e A d d B

, update the data flow graph to: Preprints 118236 g006

Then build a separate data flow graph

G^{'}

corresponding to the replaceInfo: Preprints 118236 g007

Apply the rule

C a l l B E d g e R e p l a c e

to replace the newly added nodes in

G

with

G^{'}

:

Then the algorithm will collect the modification points of the newly added symbol ReplaceInfo (Step 1) and literal assignment to ReplaceInfo.BpName (Step 2). If there is code: Preprints 118236 g009

Then the data flow graph

G

becomes:

For code involving string calculations such as substitution and concatenation. At present, some string static analysis algorithms can analyze the patterns that generate string symbols (such as automata [3,4,7,8] or regular expressions [5,6]). If any asset can match the pattern of the symbol, then it is considered to be in use.

Conclusion

Our method uses asset loading points as the root to construct a data flow graph in reverse. This method can determine which symbols to abstract. Due to the fact that only a small portion of the code is related to loading points, a large number of paths that do not interact with valid symbols will not be detected, effectively reducing the cost of analysis. In addition, in some cases, due to the complexity of real-world programs, the precise data flow of actual parameter variables is incalculable (resulting in false negatives), so our reverse analysis tool supports manually marking parameter ranges at the loading point.

References

Shun, N.A. Analysis Technology Reseach of Data Flow Oriented Java Program Pointer[J]. Computer Programming Skills & Maintenance 2014. [Google Scholar]
Thiessen, R. University of Alberta Expression Data Flow Graph: Precise Flow-Sensitive Pointer Analysis for C Programs. University of Alberta, 2011. [Google Scholar]
Gordon, M.I.; Kim, D.; Perkins, J.; et al. Information-Flow Analysis of Android Applications in DroidSafe[C]. Network & Distributed System Security Symposium; 2015. [Google Scholar]
Arceri, Vincenzo; Mastroeni, Isabella. Static Program Analysis for String Manipulation Languages. Electronic Proceedings in Theoretical Computer Science 2019, 299, 19–33. [Google Scholar]
Negrini, L.; Arceri, V.; Ferrara, P.; et al. Twinning Automata and Regular Expressions for String Static Analysis[C]. Springer: Cham, 2021. [Google Scholar]
Trinh, M.T.; Chu, D.H.; Jaffar, J. S3: A Symbolic String Solver for Vulnerability Detection in Web Applications. ACM 2014. [Google Scholar]
Aske Simon, Christensen; Anders, Møller; Michael, I. Schwartzbach. Precise Analysis of String Expressions. BRICS Report Series 2003, 10, 1. [Google Scholar]
Shannon, D.; Ghosh, I.; Rajan, S.; Khurshid, S. Efficient symbolic execution of strings for validating web applications. In Proceedings of the 2nd International Workshop on Defects in Large Software Systems Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009) - DEFECTS '09 2009; pp. 22–26.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

UE4 Redundant Asset Detection Method Based on Pointer Analysis

Abstract

Keywords:

Subject:

Method

Example

Conclusion

References

MDPI Initiatives

Important Links

Subscribe