Augmenting RSDF output from SAP BW

7 min readMay 27, 2021

This is a bit of a niche post, forgive me. But it does show how a knowledge of open source techniques (particularly Python) can get you out of many a sticky situation.

I recently helped implement SAP BW/4HANA at a local subsidiary of a Tier 1 global retailer. This was completed with the assistance of a 3rd Party testing team. They were not an SAP shop and had high expectations for how the documentation would be completed as an input to their testing activities.

We decided to use the self-generated documentation in SAP BW/4HANA to document our data flows and the object definitions via transaction code RSDF. This is a tremendous way to extract detailed, system-generated documentation at the click of a few buttons.

However we discovered that while the RSDF functionality does a good job of documenting the details of most of SAP BW system, the BW Transformation rules, especially Formula Routines, falls short. The Transformation rules for 1:1 mappings, constants and ABAP routines are fine, but the Formula details are not populated.

Not having a 100% complete record meant that we could not rely on the RSDF-generated documentation. This in turn meant we would have to create the system documentation manually at tremendous cost of effort and time. The small omission of Formula details in the RSDF documentation had put our go-live date in jeopardy.

That’s when I spent a little time brainstorming a solution. Necessity being the mother of invention we noticed that the RSDF output was in the form of an HTML document, and it struck me that we could fix this little challenge by using python screen-scraping techniques. If we could take the HTML document as an input and discover what detail was missing, we could potentially collect the missing detail through an SQL call to the BW database and then inject it into the document to fill the gap.

It worked!

Here is a description of the script we used (rough & ready, and ripe for refinement … ) — Note the full script is available at the end of the article:

1. Import Libraries:

import os
import re
import pyhdb
import pandas as pd
import xxx_dbconfig as config
from lxml import html
from bs4 import BeautifulSoup

The xxx_dbconfig reference is my connection credentials that are stored in a separate file with the following content:

hanadb = {'host': "xxx.xxx.xxx.xxx",
         'port': *****,
         'user': "********",
         'passwd': "*********"}

2. Function to build list of HTML files

def get_inputhtml(base_path, print_on):
    input_path = base_path + '/input_html/'
    folder = os.fsencode(input_path)
    filenames = []
    os.chdir(input_path)
    # loop through 'input_html' folder for all HTML files
    for file in os.listdir(folder):
        filename = os.fsdecode(file)
        if filename.endswith( ('.HTML') ):
            filenames.append(filename)
    filenames.sort()
    return(filenames)

Here we look in the sub directory ‘/input_html/’ for html files and return a list of files for further processing.

3. Function to extract the soup

def get_soup(web_template, base_path, print_on): #read an HTML file and construct the soup object
    input_path = base_path + '/input_html/'
    os.chdir(input_path)
    f = open(web_template)
    soup = BeautifulSoup(f, 'html5lib')
    f.close()
    if print_on == 'X':
        print('Finished getting soup ...')
    return(soup)

Here we take each file and extract the contents the form of a soup of HTML so that we can apply our search feature to find where the detail needs to be added.

4. Function to find Transformations in the soup

def get_transformations(soup, print_on): #scan soup object and collect unique transformations
    # find all transformations in html doc
    trfms = []
    for element in soup.find_all(text=re.compile('Transformation:')):
        parentTable = element.findParent('table')
        technm_cell = parentTable.find(text=re.compile('Technical Name:'))
        trfm = technm_cell.findNext('td')
        trfms.append(trfm.text)
    if print_on == 'X':
        print('Finished getting transformations ...')
    return(trfms)

Here we search our soup for the Transformation block, returned as a list .

5. Function to clean up the soup

def clean_soup(soup, print_on, rem_extra_detail): #clean up various links
    # remove extra detail ['Required Objects', 'ADSO Flags', 'Relays data to', 'Receives data from', 'Used by',  'Data Transfer Process'] 
    if rem_extra_detail == 'X':
        extra_detail_elements = ['Required Objects', 'ADSO Flags', 'Relays data to', 'Receives data from', 'Used by',  'Data Transfer Process'] 
        for extra_detail in extra_detail_elements:
            for block in soup.find_all(text=re.compile(extra_detail)):
                if extra_detail == 'Data Transfer Process':
                    parentTable = block.findParent('table')
                    parentTable.decompose()
                else:
                    parentTbody = block.findParent('tbody')
                    parentTbody.decompose()
            if print_on == 'X':
                print('Removed blocks containing: ' + extra_detail)
    # remove broken links from the first 'Reqired Objects' table (if not already removed above)
    indx = 0
    for broken_links in soup.find_all(text=re.compile('Required Objects')):
        if indx == 0:
            bl_parentTbody = broken_links.findParent('tbody')
            for bl_match in bl_parentTbody.find_all('a'):
                bl_match.replaceWithChildren()
        indx += 1
    # remove spans that contain 'FORMULA'
    for formula_spans in soup.find_all(text=re.compile('Formula:')):
        parentSpan = formula_spans.findParent('span')
        parentSpan.decompose()
    # repair/reformat FORMULA links
    invalid_tags = ['a', 'b', 'u']
    for tag in invalid_tags:
        for formula_links in soup.find_all(text=re.compile('FORMULA')):
            parentTd = formula_links.findParent('td')
            for match in parentTd.findAll(tag):
                match.replaceWithChildren()
    if print_on == 'X':
        print('Cleaned the soup ...')
    return(soup)

Here we take an extract of soup and clean up known issues with the way RSDF formats the HTML.

6. Function to retrive the missing detail from SAP BW

def get_data(trfms, print_on): #get active_formulas and active_tranfields
    # load data from HANA db
    connection = pyhdb.connect(host=config.hanadb['host'], 
                               port=config.hanadb['port'],
                               user=config.hanadb['user'], 
                               password=config.hanadb['passwd'])
    hana_list = str(trfms).replace('[', '(').replace(']', ')')
    cursor = connection.cursor()
    #Deliberately not relying on a single view to avoid HANA dependencies
    #Get active formulas for all transformations in the document
    where_clause = '(("TRANID" in ' + hana_list + ') AND ("OBJVERS" = \'A\') AND ( "KIND" = \'FORMULA\') )'
    stmnt = 'SELECT * from SAPBWH.RSTRANSTEPROUT WHERE ' + where_clause
    cursor.execute(stmnt)
    active_formulas = cursor.fetchall()
    #Get ibjs for rule steps above
    where_clause = '(("OBJVERS" = \'A\') AND ("TRANID" in ' + hana_list + ') AND ("PARAMTYPE" = \'1\'))'
    stmnt = 'SELECT * FROM "SAPBWH"."RSTRANFIELD" WHERE ' + where_clause 
    cursor.execute(stmnt)
    active_tranfields = cursor.fetchall()
    #Get rule step texts (used for formulas)
    where_clause = '( ("LANGU" = \'EN\') AND ("TRANID" in ' + hana_list + ') AND ("OBJVERS" = \'A\') )'
    stmnt = 'SELECT * FROM "SAPBWH"."RSTRANRULET" WHERE ' + where_clause 
    cursor.execute(stmnt)
    active_trantexts = cursor.fetchall()
    cursor.close()
    
    #!!*** todo: get Descriptions ***!!
    #for tran_step in active_formulas:
    #active_formulas_wtxts = [ft for ft in active_trantexts if ft[4] == tran_step[]]
    #!!*** ******************* ***!!
    
    if print_on == 'X':
        print('Retrieved data from HANA ...')
    return(active_formulas, active_tranfields)

Here go to the SAP BW database and look-up the missing detail for the Transformations that contain Formulas.

7. Function to inject the Formula detail into the HTML

def inject_formulas(trfms, soup, active_formulas, active_tranfields, print_on):
    # loop over transformations and link formula code, and insert
    for trfmation in trfms:
        if print_on == 'X':
            print('++++++++++++++++++++')
            print('Trfmtn: ' + trfmation)
        formula_steps = [sfs for sfs in active_formulas if sfs[0] == trfmation]
        for single_formula_step in formula_steps:
            #formula_descr = 'Description: ' + 
            formula_cd = 'Formula: ' + str(single_formula_step[7])
            single_trfm_rules = [y for y in active_tranfields if y[0] == trfmation]
            single_step = [y for y in single_trfm_rules if y[3] == single_formula_step[2]]
            iobj_id = single_step[0][6]
            if print_on == 'X':
                print('IObj: ' + iobj_id)
                print(formula_cd)
                print('____________________')
            span_txt = formula_cd
            span_id = 'formula_' + iobj_id
            new_span = soup.new_tag('div', id=span_id)      
            new_span.string = str(span_txt)
            new_br = soup.new_tag('br')
            new_span.insert(0, new_br)
            for element2 in soup.find_all(text=re.compile('Transformation:')):
                parentTable2 = element2.findParent('table')
                curr_trfm_cell = parentTable2.find(text=re.compile('Technical Name:'))
                curr_trfm_id = curr_trfm_cell.findNext('td').text
                if curr_trfm_id == trfmation:
                    group_table = parentTable2.find(text=re.compile('Group')).find_parent('table')
                    iobj_element = group_table.find(text=re.compile(str(iobj_id)))
                    parentTbody = iobj_element.find_parent('tbody')
                    parentTbody.find_all("span")[-2].append(new_span)
    if print_on == 'X':
        print('Finished injecting Formulas ...')
    return(soup)

This function loops through the list of relevant transformations (those with missing Formula definitions) and add in the detail that was collected from the SAP BW database.

8. Function to write the corrected HTML to file

def write_output(soup, base_path, print_on):
    # output modified html
    html_output_path = base_path + '/output_html/'
    os.chdir(html_output_path)
    html = soup.prettify("latin-1")
    input_fileName,fileType=web_template.split('.')
    output_filename = input_fileName + "_output." + fileType
    if print_on == 'X':
        print('Writing file: ' + output_filename)
    with open(output_filename, "wb") as file:
        file.write(html)
    if print_on == 'X':
        print('Finished writing file: ' + output_filename)

Here we take the corrected HTML and write it back out to file.

9. Function to convert HTML files to PDFs

def convert_pdf(base_path, print_on):
    #import libraries
    import pdfkit
    filenames = []
    # pdf_input_path is the output path of the HTML documents
    pdf_input_path = base_path + '/output_html/'
    folder = os.fsencode(pdf_input_path)
    # loop through source folder for all HTML files
    for file in os.listdir(folder):
        filename = os.fsdecode(file)
        if filename.endswith( ('_output.HTML') ):
            filenames.append(filename)
    filenames.sort()
    if print_on == 'X':
        print (' ')
        print('Start of PDF conversion ...')
    # For every HTML file: produce PDF
    pdf_output_path = base_path + '/output_pdf/'
    folder = os.fsencode(pdf_output_path)
    for file in filenames:
        fileName,fileType=file.split('.')
        pdf_filename = fileName + "_" + fileType + ".pdf"
        if print_on == 'X':
            print ("-> Processing: " + file)
        pdfkit.from_file(pdf_input_path + file, pdf_output_path + pdf_filename)
        if print_on == 'X':
            print ("... copied to: " + pdf_filename)
            print(' ')
    if print_on == 'X':
        print (' ')
        print('End of PDF conversion ...')
        print('____________________')

10. Main script block

#---***---   Main Code Block  ---***---
# start of flags section for controlling execution
rem_extra_detail = 'X' #superfluous detail flag
print_on = 'X'         #print messages flag
pdf_on = 'X'           #conversion to PDF flag
# end of flags section for controlling execution# get local path
base_path = os.getcwd()# get list of HTML files to process
filenames = get_inputhtml(base_path, print_on)# loop through relevant files 
for web_template in filenames:
    # control messaging
    if print_on == 'X':
        print(' ')
        print('--> file: ' + web_template)
        print(' ')
    
    # convert html to BeautifulSoup object
    soup = get_soup(web_template, base_path, print_on)
    
    # extract transformations from the soup
    transformations = get_transformations(soup, print_on)
    
    # clean-up the soup, removing detail as required
    soup_mod1 = clean_soup(soup, print_on, rem_extra_detail)
    
    # query SAP HANA to return Formula details
    active_formulas, active_tranfields = get_data(transformations, print_on)
    
    # construct and then inject 'formula' HTML and write to output HTML
    soup_mod2 = inject_formulas(transformations, soup_mod1, active_formulas, active_tranfields, print_on)
    write_output(soup_mod2, base_path, print_on)
    
# if required, output to PDF
if pdf_on == 'X':
    convert_pdf(base_path, print_on)print(' ... and that\'s a wrap!')

The main block processes the data in the following order:

Get the list if HTML files
For each file do the following:

Convert html to BeautifulSoup object
Extract transformations from the soup
Clean-up the soup, removing detail as required
Query SAP HANA to return Formula details
Construct and then inject ‘formula’ HTML and write to output HTML
Finally: if required, output to PDF

Conclusion

And there you go. I hope this is useful to someone out there. Please let me know if you do use this on your own implementation.