Submitted:
02 July 2025
Posted:
03 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
3. Background
3.1. Ethereum and Solidity
3.2. Yul Intermediate Layer
| Listing 1. Yul specification schema. |
| fullflexible4pt3.5pt Block = ’{’ Statement* ’}’ Statement = Block | FunctionDefinition | VariableDeclaration | Assignment | If | Expression | Switch | ForLoop | BreakContinue | Leave FunctionDefinition = ’function’ Identifier ’(’ TypedIdentifierList? ’)’ ( ’->’ TypedIdentifierList )? Block VariableDeclaration = ’let’ TypedIdentifierList ( ’:=’ Expression )? Assignment = IdentifierList ’:=’ Expression Expression = FunctionCall | Identifier | Literal If = ’if’ Expression Block Switch = ’switch’ Expression ( Case+ Default? | Default ) Case = ’case’ Literal Block Default = ’default’ Block ForLoop = ’for’ Block Expression Block Block BreakContinue = ’break’ | ’continue’ Leave = ’leave’ FunctionCall = Identifier ’(’ ( Expression ( ’,’ Expression )* )? ’)’ Identifier = [a-zA-Z_$] [a-zA-Z_$0-9.]* IdentifierList = Identifier ( ’,’ Identifier)* TypeName = Identifier TypedIdentifierList = Identifier ( ’:’ TypeName )? ( ’,’ Identifier ( ’:’ TypeName )? )* Literal = (NumberLiteral | StringLiteral | TrueLiteral | FalseLiteral) ( ’:’ TypeName )? NumberLiteral = HexNumber | DecimalNumber StringLiteral = ’"’ ([^"\r\n\\] | ’\\’ .)* ’"’ TrueLiteral = ’true’ FalseLiteral = ’false’ HexNumber = ’0x’ [0-9a-fA-F]+ DecimalNumber = [0-9]+ |
3.3. Importance of Yul Optimization
3.4. Representation Learning
4. Yul Code Vectorization
4.1. Vectorizing Atomic Yul Entities
4.2. Vectorizing Entire YUL Contracts
| Listing 2. Triplets extraction from Yul code. |
| fullflexible4pt3.5pt function extract_arguments(ast): arguments = [] for arg in ast: if arg["nodeType"] eq "YulFunctionCall": arguments += "function" else if arg["nodeType"] eq "YulIdentifier": arguments += "variable" else if arg["nodeType"] eq "YulLiteral": arguments += "constant" else: arguments += "unknown" return arguments function extract_function_call(ast): opcode_name = ast["functionName"]["name"] opcode_type = EVM_OPCODES[ast["functionName"]["name"]] if opcode_type is null: opcode_type = "unknownTy" opcode_name = "functioncall" append_triplet(opcode_name, opcode_type, extract_arguments(ast[’arguments’])) function extract_triplets(ast): nodeType = ast["nodeType"] match nodeType: case "YulObject": extract_triplets(ast["code"]) case "YulCode": extract_triplets(ast["block"]) case "YulBlock": for statement in ast["statements"]: extract_triplets(statement) case "YulFunctionCall": extract_function_call(ast) case "YulExpressionStatement": extract_triplets(ast["expression"]) case "YulVariableDeclaration" | "YulAssignment": extract_triplets(ast["value"]) case "YulFunctionDefinition": extract_triplets(ast["body"]) case "YulIf": append_triplet("if", "booleanTy", extract_arguments([ast[’condition’]])) extract_triplets(ast["body"]) case "YulForLoop": append_triplet("forloop", "voidTy") extract_triplets(ast["pre"]) extract_triplets(ast["condition"]) extract_triplets(ast["post"]) extract_triplets(ast["body"]) case "YulSwitch": append_triplet("switch", "voidTy", extract_arguments([ast[’expression’]])) for switch_case in ast["cases"]: extract_triplets(switch_case["body"]) case "YulBreak": append_triplet("break", "voidTy") case "YulContinue": append_triplet("continue", "voidTy") case "YulLeave": append_triplet("leave", "voidTy") case "YulIdentifier" | "YulLiteral": # do nothing for yul_code in yul_dataset: ast = parse_ast(yul_code) extract_triplets(ast) |
4.3. Context Awareness
4.4. Aggregated Yul Code Vectorization
- Variable definitions: A variable declaration without an assigned value does not contribute to the aggregated vector.
- Assignments: As described in the previous subsection, assignments do not produce a direct vectorized representation. Instead, the vector of the assigned expression is computed and stored in memory (referred to as the context map) for future use, but the assignment operation itself results in a zero vector.
- Function definitions: Defining a function does not contribute directly to the final aggregated vector, as function definitions return a zero vector. However, the function’s vector representation is computed and stored in the context map, ensuring that whenever the function is invoked, its computed vector is retrieved and incorporated into subsequent expressions.
| Listing 1. XXX |
| fullflexible4pt3.5pt function vectorize_arguments(ast): final_vector = [] for arg in ast: if arg["nodeType"] eq "YulFunctionCall": final_vector += vectorize_function_call(arg) else if arg["nodeType"] eq "YulIdentifier": final_vector += context[arg["name"]] || embeddings["variable"] else if arg["nodeType"] eq "YulLiteral": final_vector += embeddings["constant"] return final_vector function vectorize_function_call(ast): op_code = ast["functionName"]["name"] if op_code in EVM_OPCODES[op_code]: op_code_type = EVM_OPCODES[op_code] op_code_vec = embeddings[op_code] || embeddings["functioncall"] else: op_code_type = ’unknownTy’ if op_code not in context["functions"]: op_code_vec = embeddings["functioncall"] else: if context["functions"][op_code]["calculated"]: op_code_vec = context["functions"][op_code]["vector"] else: op_code_vec = vectorize_ast(context["functions"][op_code]["code"]) context["functions"][op_code]["calculated"] = true context["functions"][op_code]["vector"] = op_code_vec return op_code_vec * ow + embeddings[op_code_type] * tw + vectorize_arguments(ast[’arguments’]) * aw def vectorize_ast(ast, ctx): nodeType = ast["nodeType"] match nodeType: case "YulObject": return vectorize_ast(ast["code"]) case "YulCode": return vectorize_ast(ast["block"]) case "YulBlock": final_vector = [] for statement in ast["statements"]: final_vector += vectorize_ast(statement) return final_vector case "YulFunctionCall": return vectorize_function_call(ast) case "YulExpressionStatement": return vectorize_ast(ast["expression"]) case "YulVariableDeclaration" | "YulAssignment": var_vector = vectorize_ast(ast["value"]) for var in ast["variableNames"]: context[var["name"]] = var_vector return [] case "YulFunctionDefinition": if ast["name"] not in context["functions"]: context["functions"][ast["name"]] = {"calculated": False, "vector": [], "code": ast["body"]} return [] case "YulIf": return embeddings["if"] * ow + embeddings["booleanTy"] * tw + vectorize_ast(ast[’condition’]) * aw + vectorize_ast(ast[’body’]) case "YulForLoop": final_vector = [] final_vector += embeddings["forloop’)] * ow final_vector += embeddings["voidTy"] * tw final_vector += vectorize_ast(ast["pre"]) final_vector += vectorize_ast(ast["condition"]) * aw final_vector += vectorize_ast(ast["post"]) final_vector += vectorize_ast(ast["body"]) return final_vector case "YulSwitch": final_vector = [] final_vector += embeddings["switch"] * ow final_vector += embeddings["voidTy"] * tw final_vector += vectorize_ast(ast["expression"]) * aw for switch_case in ast["cases"]: final_vector += vectorize_ast(switch_case["body"]) return final_vector case "YulBreak": return embeddings["break"] * ow + embeddings["voidTy"] * tw case "YulContinue": return embeddings["continue"] * ow + embeddings["voidTy"] * tw case "YulLeave": return embeddings["leave"] * ow + embeddings["voidTy"] * tw case "YulIdentifier": return context[ast["name"]] || embeddings["variable"] case "YulLiteral": return embeddings["constant"] for yul_code in yul_dataset: ast = parse_ast(yul_code) vectorize_ast(ast) vectorize_ast(ast) |
5. Experimental Results
5.1. Yul Entities Embeddings
5.2. Yul Vector Distribution
-
Automated Categorization of 2,000 Samples: Embeddings were computed for the first 2,000 entries from the dataset and automatically assigned to one of five categories:
- Initial Code (red) — Yul code produced as the constructor portion during Solidity compilation. Such code typically includes boilerplate logic for deploying contract bytecode onto the blockchain.
- Libraries (blue) — Well-known and commonly used libraries, such as Strings or Base64.
- ERC-20 Contracts (green) — Contracts implementing the ERC-20 standard for fungible tokens.
- ERC-721 Contracts (yellow) — Contracts implementing the ERC-721 standard for non-fungible tokens.
- Other (purple) — Contracts not falling into any of the above categories.
Classification into the first two categories is highly reliable: initial code files typically have the suffix "initial" in their filenames, while library files often use the library name as a prefix. For ERC-20 and ERC-721 contracts, classification was based on the presence of relevant keywords (ERC20, ERC721) in the code. Although this heuristic is not perfect, it is sufficiently accurate for identifying general distributional trends. - Manual Evaluation on Selected Contracts: A small set of contracts was manually selected, with known categories based on prior knowledge. These samples were used to validate the embedding-based grouping from a qualitative standpoint.
6. Summary and Future Work
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Buterin, V. A next-generation smart contract and decentralized application platform. white paper 2014. [Google Scholar]
- Foundation, E. YUL documentation, 2018.
- Muchnick, S.S. Advanced compiler design and implementation; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998. [Google Scholar]
- Almagor, L.; Cooper, K.D.; Grosul, A.; Harvey, T.J.; Reeves, S.W.; Subramanian, D.; Torczon, L.; Waterman, T. Finding effective compilation sequences. ACM SIGPLAN Notices 2004, 39, 231–239. [Google Scholar] [CrossRef]
- Cooper, K.D.; Schielke, P.J.; Subramanian, D. Optimizing for reduced code space using genetic algorithms. In Proceedings of the Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems, 1999, pp.
- Cooper, K.D.; Subramanian, D.; Torczon, L. Adaptive optimizing compilers for the 21st century. The Journal of Supercomputing 2002, 23, 7–22. [Google Scholar] [CrossRef]
- Kulkarni, P.; Zhao, W.; Moon, H.; Cho, K.; Whalley, D.; Davidson, J.; Bailey, M.; Paek, Y.; Gallivan, K. Finding effective optimization phase sequences. ACM SIGPLAN Notices 2003, 38, 12–23. [Google Scholar] [CrossRef]
- Kulkarni, S.; Cavazos, J. Mitigating the compiler optimization phase-ordering problem using machine learning. SIGPLAN Not. 2012, 47, 147–162. [Google Scholar] [CrossRef]
- Jain, S.; Andaluri, Y.; VenkataKeerthy, S.; Upadrasta, R. POSET-RL: Phase ordering for Optimizing Size and Execution Time using Reinforcement Learning. In Proceedings of the 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS); 2022; pp. 121–131. [Google Scholar] [CrossRef]
- Huang, Q.; Haj-Ali, A.; Moses, W.; Xiang, J.; Stoica, I.; Asanovic, K.; Wawrzynek, J. AutoPhase: Compiler Phase-Ordering for HLS with Deep Reinforcement Learning. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM); 2019; pp. 308–308. [Google Scholar] [CrossRef]
- Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. code2vec: learning distributed representations of code. Proc. ACM Program. Lang. 2019, 3. [Google Scholar] [CrossRef]
- VenkataKeerthy, S.; Aggarwal, R.; Jain, S.; Desarkar, M.S.; Upadrasta, R.; Srikant, Y.N. IR2VEC: LLVM IR Based Scalable Program Embeddings. ACM Trans. Archit. Code Optim. 2020, 17. [Google Scholar] [CrossRef]
- Fonal, K. 2025; arXiv:cs.SE/2506.19153].
- Allamanis, M.; Barr, E.T.; Bird, C.; Sutton, C. Suggesting accurate method and class names. In Proceedings of the Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, New York, NY, USA, 2015. [CrossRef]
- Allamanis, M.; Brockschmidt, M.; Khademi, M. Learning to Represent Programs with Graphs. In Proceedings of the International Conference on Learning Representations; 2018. [Google Scholar]
- Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. A general path-based representation for predicting program properties. SIGPLAN Not. 2018, 53, 404–419. [Google Scholar] [CrossRef]
- Brauckmann, A.; Goens, A.; Ertel, S.; Castrillon, J. Compiler-based graph representations for deep learning models of code. In Proceedings of the Proceedings of the 29th International Conference on Compiler Construction, New York, NY, USA, 2020. [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. 2013; arXiv:cs.CL/1301.3781].
- Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). [CrossRef]
- Alon, U.; Brody, S.; Levy, O.; Yahav, E. code2seq: Generating Sequences from Structured Representations of Code. In Proceedings of the International Conference on Learning Representations; 2019. [Google Scholar]
- Kanade, A.; Maniatis, P.; Balakrishnan, G.; Shi, K. 2020; arXiv:cs.SE/2001.00059].
- Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.
- Gupta, R.; Pal, S.; Kanade, A.; Shevade, S. DeepFix: fixing common C language errors by deep learning. In Proceedings of the Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
- Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; Zhu, X. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
- Ji, G.; He, S.; Xu, L.; Liu, K.; Zhao, J. Knowledge Graph Embedding via Dynamic Mapping Matrix. In Proceedings of the Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). [CrossRef]
- Foundation, E. YUL documentation, 2014.
- Foundation, E. YUL documentation, 2018.
- Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating Embeddings for Modeling Multi-relational Data. In Proceedings of the Advances in Neural Information Processing Systems; Burges, C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K., Eds. Curran Associates, Inc., Vol. 26. 2013. [Google Scholar]
- Han, X.; Cao, S.; Lv, X.; Lin, Y.; Liu, Z.; Sun, M.; Li, J. OpenKE: An Open Toolkit for Knowledge Embedding. In Proceedings of the Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. [CrossRef]
| 1 | Default optimization steps settled by Solidity team can be found at https://github.com/ethereum/solidity/blob/develop/libsolidity/interface/OptimiserSettings.h#L42
|
| 2 | Report from this research can be found at https://github.com/ethereum/solidity/issues/7806
|







| Opcode Number | Opcode Name | Minimum Gas | Description |
|---|---|---|---|
| 0x00 | STOP | 0 | Halts execution |
| 0x01 | ADD | 3 | Addition operation |
| 0x02 | MUL | 5 | Multiplication operation |
| 0x03 | SUB | 3 | Subtraction operation |
| 0x04 | DIV | 5 | Integer division operation |
| 0x06 | MOD | 5 | Modulo operation |
| 0x0a | EXP | 10 | Exponential operation |
| 0x10 | LT | 3 | Less-than comparison |
| 0x11 | GT | 3 | Greater-than comparison |
| 0x12 | SLT | 3 | Signed less-than comparison |
| 0x13 | SGT | 3 | Signed greater-than comparison |
| 0x14 | EQ | 3 | Equality comparison |
| 0x15 | ISZERO | 3 | Simple not operator |
| 0x16 | AND | 3 | Bitwise AND operation |
| 0x17 | OR | 3 | Bitwise OR operation |
| 0x18 | XOR | 3 | Bitwise XOR operation |
| 0x19 | NOT | 3 | Bitwise NOT operation |
| 0x1a | BYTE | 3 | Retrieve single byte from word |
| 0x30 | ADDRESS | 2 | Get address of currently executing account |
| 0x31 | BALANCE | 100 | Get balance of the given account |
| 0x32 | ORIGIN | 2 | Get execution origination address |
| 0x33 | CALLER | 2 | Get caller address |
| 0x36 | CALLDATASIZE | 2 | Get size of input data |
| 0x37 | CALLDATACOPY | 3 | Copy input data to memory |
| 0x3a | GASPRICE | 2 | Get price of gas in current environment |
| 0x3b | EXTCODESIZE | 100 | Get size of an account’s code |
| 0x3c | EXTCODECOPY | 100 | Copy an account’s code to memory |
| 0x3f | EXTCODEHNSH | 100 | Get hash of an account’s code |
| 0x41 | COINBASE | 2 | Get the block’s beneficiary address |
| 0x42 | TIMESTAMP | 2 | Get the block’s timestamp |
| 0x43 | NUMBER | 2 | Get the block’s number |
| 0x45 | GASLIMIT | 2 | Get the block’s gas limit |
| 0x50 | POP | 2 | Remove item from stack |
| 0x51 | MLOAD | 3 | Load word from memory |
| 0x52 | MSTORE | 3 | Save word to memory |
| 0x53 | MSTORE8 | 3 | Save byte to memory |
| 0x54 | SLOAD | 100 | Load word from storage |
| 0x55 | SSTORE | 100 | Save word to storage |
| 0x56 | JUMP | 8 | Alter the program counter |
| 0x57 | JUMPI | 10 | Conditionally alter the program counter |
| Opcode Entities | Type Entities | Argument Entities |
|---|---|---|
| break | addressTy | function |
| forloop | booleanTy | variable |
| functioncall | numberTy | constant |
| if | voidTy | |
| leave | unknownTy | |
| return | ||
| switch | ||
| … |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).